How We Ranked First in the HRT (Hudson River Trading) and Partcl Macro Placement Challenge

A technical write-up of the approach behind our verified rank-1 score on the IBM macro-placement benchmarks.

run benchmark evaluations collect proxy components reject invalid or timeout cases compare against current baseline promote only full-suite improvements

The graph is from the early auto-research phase. It shows the main point: progress came from repeated full-suite measurements, not from judging placements visually.

Phase 3: Fast Runtime Proxy Evaluation

We could not call the public scorer after every possible move, so we built an incremental proxy evaluator for local repair. For each trial macro move, it updated only the affected state:

A move was kept only if the measured proxy improved. The subtle part was numerical compatibility. A placement that looked legal in float64 local geometry could still become an overlap after the official float32 scorer rounded positions. We moved legality and grid calculations toward scorer-compatible float32 behavior and added explicit clearance gaps, so “zero overlaps” locally matched zero overlaps in final scoring.

Phase 4: Multi-Start Search

The naive multi-start version was simple: generate starts, run repair, pick the best. That helped, but blind restarts waste the one-hour benchmark budget.

The useful version was multi-seam search. We generated physically different starting basins, legalized them, prescored proxy and congestion, and sent only the best few into long repair.

The seed portfolio included:

One seed-attribution experiment made this concrete. It ran 17 / 17 public IBM designs with 10 workers and evaluated 12 seed algorithms: raw legalized starts, analytical starts, blend starts, expansion starts, route-channel starts, and a few exploratory heuristic starts.

The winners were not evenly distributed:


That result was useful because it removed guesswork. Expansion, raw legalized, and route-channel basins were worth keeping. The exploratory heuristic seeds did not beat those seed classes. The traced phases also showed that exact congestion refinement was the best final traced stage for every design in that experiment.

The important step was prescoring. A seed was not allowed to consume the full budget just because it looked interesting. The harness legalized it, computed proxy components, ranked by proxy and congestion, then polished only a small queue.

The pattern looked like this:

generate seed basins
legalize hard macros
prescore proxy and congestion
keep the best few legal seeds
spend repair time on the winners
fall back if a candidate source fails

That made multi-start a seed-selection problem rather than a random-restart problem.

Phase 5: Synthetic Clearance

One of the most important seed-generation approaches came from a simple physical observation: the baseline was too packed in the wrong places.

We started adding synthetic clearance to smaller macros before legal repair. The useful configuration selected movable macros up to the 97th percentile by area, added artificial separation, and used a vectorized Jacobi-style push-apart step before hard-macro legalization.

Key settings in the final clearance approach were:

The point was not to change official macro sizes. The point was to create routing slack before exact repair. The legalizer then repaired hard-hard conflicts using pairwise push-apart, conflict-component handling, spiral search, emergency clearing, and displacement reduction.

Synthetic clearance worked when it created better candidate basins. It failed when it pushed connected objects too far apart. That is why it stayed behind exact proxy acceptance.

Phase 6: Soft-Macro Repair

Early versions treated soft macros mostly as cleanup. That was not enough.

Soft macros can move without reopening the hard-macro legality problem. That made them useful late-stage actuators for density and congestion. A hard macro move can destroy a channel or create a hard overlap. A soft-macro move can relieve routing pressure with much lower geometric risk.

The strongest version was interleaved coordinate descent:

The practical settings reflected that priority:

At this point, the flow had the pieces needed for measured repair: seed selection, exact local repair, soft-macro congestion control, and runtime allocation.

Phase 7: Congestion-Weighted Search

The official proxy was fixed:

wirelength + 0.5 * density + 0.5 * congestion

But the internal search objective did not have to rank candidate proposals with the same naive balance at every stage. Once we saw congestion dominating the remaining score, we tested congestion-weighted repair. In the late congestion-weighted repair line, a promoted setting used:

WL + density + 2.5 * congestion

That did not replace final evaluation. It changed proposal pressure. A candidate still had to survive the real proxy gates before it mattered.

This distinction was important. The internal objective could bias the search toward opening routing capacity. The official proxy still decided whether the placement improved.

The promoted congestion-weighted approach reached a full-suite average around 1.0471 with all designs valid and under the time cap. The useful change was proposal ranking, not the final scoring rule.

Phase 8: Plateau Escape

After enough exact repair, the optimizer reached plateaus: many legal candidates were evaluated, but very few were accepted. At that point, running the same move class longer was not enough.

The later GPU logs made the scale clearer:

Across the GPU repair logs, the system evaluated 11,674,931 candidates and accepted 229,169 moves, an accept rate of 1.96%. That telemetry drove the next step: change the proposal distribution, not the acceptance rule.

The plateau escape move classes were:

The evaluator stayed strict. We changed how candidates were generated, not what counted as success.

Phase 9: GPU Acceleration

Early in the contest, we assumed GPU throughput would decide how many useful candidates we could evaluate inside one hour. Development runs used NVIDIA L4 workers; the public evaluation environment listed an NVIDIA RTX 6000 Ada 48GB with AMD EPYC 9655P CPU, 16 cores, and 100GB memory. The contest Docker base was pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime.

The main change was moving candidate generation and ranking out of Python loops:

In a congestion-heavy GPU repair setting, the proposal stage ranked 80 macros x top-160 proposals, or 12,800 candidate slots, before exact accept/reject filtering. Triton experiments batched 8,192 to 16,384 proposals per pass. The GPU work did not replace scoring; it made scoring spend time on better candidates.

Phase 10: Xplace-RA Route-Aware Seeds

Xplace is a GPU-accelerated analytical placer from CUHK EDA. We used a patched Xplace checkout as a route-aware seed generator, not as the final scorer. This is different from XLA, or Accelerated Linear Algebra; XLA was not part of this flow.

In our flow, Xplace-RA meant:

The score progression looked like this:

The submit-ready selector was especially important because it stayed generic: 17 / 17 designs valid, 0 / 17 overlap failures, average wirelength 0.075941, density 0.519471, congestion 1.251706, and a 3300-second placement cap.

The route-aware build image used pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel. The Docker path cloned cuhk-eda/Xplace into /opt/Xplace, copied our patch bundle on top, built it with CUDA architecture 89, and set EVOXRAXPLACE_ROOT=/opt/Xplace. If patched Xplace was missing or failed gates, the selector used the in-house exact-repair candidate instead.

That fallback logic was part of the algorithm. A strong optional candidate source is useful only if failure does not poison the final placement.

Phase 11: Triton Candidate Ranking

Triton was used as an experimental kernel path for candidate ranking. The goal was specific: batch many proposals, rank them on GPU, and feed only promising candidates into the exact evaluator.

The first modes did not beat the baseline, so Triton stayed gated:

The lesson was simple: faster candidate generation helps only when it increases accepted improvements.

The Final System

The runtime-controlled flow looked like this:

What We Learned

Macro placement is not solved by a floorplan that only looks clean. The useful question is whether the placement leaves routing capacity where the netlist needs it.

Our main takeaways were:

The final verified leaderboard score was a rank-1 average proxy cost of 0.9507 with zero hard-macro overlaps. The technical process behind that score was straightforward in principle: propose broadly, score exactly, accept carefully, and validate across the full benchmark suite.


The team behind Archgen Submission :
Naveen Venkat, Hariharan Ayappane, and Jishnu Madhav.

Visit ArchGen.tech for more details.


Disclaimer: This article was written with the help of AI.

Back to Blog Original Substack