Worklog - Detailed Router

The Via-Site Roulette Problem

This is the short version of a router debugging loop: the board was close, but the last opens kept moving. The lesson matched our own SOTA survey: detailed routing needs deterministic candidate ordering and a real pin-access/via-site plan, not another blind cleanup sweep.

Reference

OpenROAD / TritonRoute

Observed

The same board kept changing failure shape

Single-round runs bounced between roughly four and seven failed router nets. The visible failures were concentrated around BGA and HDMI escape channels, but the specific victims changed between runs. That made cleanup sweeps look more effective than they really were.

SOTA Gap

We were missing detailed-router discipline

The SOTA survey calls this out in two places: OpenROAD/TritonRoute-style flows use global guides plus pin-access/via candidates before detailed route commit, while KiCad/Freerouting-style flows rely on deterministic costs, legal local edits, and post-route optimization rather than random late cleanup.

Tried

Hard-net-first and final dangling-via cleanup were not enough

Hard-net-first hardening sometimes improved one benchmark, then regressed later runs. A final dangling-signal-via filter removed via_dangling DRC markers but converted them into worse opens. Both were useful experiments, but neither was promoted as default behavior.

Closed

Power-via hole spacing and deterministic A* frontier ordering

Same-net power vias now respect mechanical drill spacing unless they are exactly the same site and can be canonicalized. The 3D A* frontier now has explicit tie-breaks, and via edges are sorted by cost, target layer, and padstack. That does not solve all BGA escapes, but it closes a real reproducibility gap.

Closed

Candidate validation learned about pads

A later run exposed a stricter detailed-router gap: rescue candidates were checked against traces and vias, but not foreign pads. That allowed a route to look legal to the router while KiCad DRC later reported a trace-pad short. Candidate validation now rejects trace-pad collisions before commit while preserving legitimate access to the route net’s own through-hole pad obstacle.

Closed

Benchmark promotion now treats shorts as fatal

The three-round rerun reproduced the dangerous pattern: R2 reached 2 shorts / 4 opens, which looks better under a naive actionable-count score than R1 at 0 shorts / 10 opens. The loop now ranks candidates by short-free status, short count, open count, then total DRC, so it restored R1 instead of promoting an electrically invalid board.

Tried

Via-site retries helped locally but did not move the best score

Component patching now turns illegal drill-spacing sites into temporary all-layer keepouts and retries the same bridge before asking ECO to move blockers. The next benchmark exercised that path and stayed safe, but the headline result was unchanged: R1 remained the best short-free board at 0 shorts / 10 opens, while R2 still regressed to 2 shorts / 4 opens.

Still Open

The remaining opens are pin-access and shove failures

The current best is clean on shorts, but the last opens concentrate around dense HDMI/BGA escape nets and nearby power-filter islands. The logs show repeated rejected component patches, hole-spacing dead ends, and ECO attempts that cannot move enough neighboring geometry. That points back to the SOTA gaps we have not closed yet: a real pin-access oracle, global routing guides, and recursive shove/rollback.

Closed

Detailed ECO learned recursive shove and rollback

The component patcher no longer treats a blocking track as immovable. It now tries a transactional shove: remove the target, reroute it, recursively reroute the blocker, and accept only if the full transaction reconnects cleanly. That is the first generic KiCad-PNS-style shove stack in our automated router, not a net-name special case.

Closed

Pin access is filtered before patch routing

Patch candidate generation now builds pad-access points through the same obstacle model the router will use, then drops access nodes whose center-to-access segment or landing point is blocked. This is still smaller than a full TritonRoute pin-access oracle, but it moves the failure earlier: bad pad exits are rejected before they become misleading bridge candidates.

Closed

Global guides became route intent

Coarse route guides are no longer only a last-ditch fallback after direct A* fails. Long/problematic branches try guide-first routing, and each guide is projected into a multi-layer sketch corridor that gives detailed A* a soft reward for staying near the intended topology. That closes the Siemens/Altium sketch-routing gap in generic, machine-generated form.

Closed

Strict candidate validation moved ahead of commit

Rescue, direct patch, and detailed ECO candidates now run through a pre-commit geometry validator that rejects trace-via, trace-pad, trace-trace, via-track, via-via, and via-pad conflicts. The important architectural change is that cleanup sweeps are no longer the first place those illegal candidates are discovered.

Closed

Global routing started reserving capacity before detailed A*

The router now builds a full set of coarse branch guides before pathfinding, sorts harder branches earlier, reserves their guide corridors into Pathfinder history, and seeds the branch guide cache. This is the first real OpenROAD/FastRoute-style global-routing stage in our flow, although it is still a capacity bias rather than a complete guide database.

Measured

Single-run scores vary; placer non-determinism is dominating

One-round benchmarks have produced 0 shorts at 7, 8, and 12 unique opened nets across consecutive runs with no router-source change between some of them. The PyTorch analytical placer runs on CUDA without an explicit seed, so identical TOML input yields slightly different placements and therefore different routing problems. Until the placer is seeded, no single-run number can be read as exact router quality. The earlier 0 / 4 figure cited in older notes was 2 shorts / 4 opens under an old promotion rule — it is electrically invalid and not a target.

Closed

Endpoint radial access is now load-bearing for correctness

ROUTER_ENDPOINT_ACCESS defaults to true. With it disabled, the current binary emits 199 shorts on this board — the radial access nodes are no longer optional, they are how starved BGA pads pick up legal escape edges. Worth noting they only add planar (LayerEdge) edges, not via edges, and they do not yet check that a candidate access point would be safe as a via site.

No-op

Pair coupling bias is decorative on this board

A targeted sweep confirmed that DIFF_PAIR_COUPLING_STRENGTH 0.15 vs 0.35 produces byte-identical router output: same trace count, same via count, same eleven failed nets. The reference-bias reward is a soft cost reduction (peaks at ~35% for parallel segments at the requested gap), and on this congestion landscape the alternatives are still much costlier — A* picks the same path either way. The infrastructure is not harmful but the worklog should not claim a closed gap; the real fix is coupled 2-net A*.

Closed

Pin-access oracle v2: row detection + widened gate clears six

Two iterations on the oracle. v1 (filter access-node placements by foreign-pad collisions on every spanned layer) took the floor from 12 unique opened nets to 8. v2 added (a) PCA-based connector-row detection so signal pads in a row of foreign GND pads are restricted to perpendicular escape, (b) matching for both __pad_obstacle__ and __pth_obstacle__ obstacle prefixes (the v1 oracle silently missed every PTH connector pin), and (c) a widened gate so dense PTH endpoints get access-node injection even when they have plenty of natural visibility-graph edges. Net effect: 8 unique nets to 6, with HDD_LED_N, PWR_LED_P, RST_BTN_L (J5 control), I2C_SCL, CSI_D0N, CSI_D1N all clearing. The widened gate (50% more access-node injection events: 1089 to 1660) is the bigger lever; the row-axis filter helps a small number of header endpoints specifically.

Tried + reverted

Per-net layer cost knobs cannot fix middle-board congestion

After the oracle wins, two layer-assignment heuristics were tried: (a) penalise F.Cu for diff-pair nets so HDMI/CSI lanes prefer B.Cu, freeing F.Cu for control nets — broke BGA fanouts that need F.Cu microvias (8 unique opens to 13); (b) discount In1.Cu for diff-pair nets so long traversals stay on the inner layer — barely shifted route layers (B.Cu 13 vs baseline 15, In1.Cu unchanged) and chaotic per-pair side flips made things worse (8 to 13). Diagnostic: the actual emitted geometry already does multi-layer hop-overs freely (16 signal nets have >=3 vias each, F.Cu carries 257 segments, In1.Cu 79, B.Cu 76). Layer-switching capacity is not the bottleneck; genuine middle-board congestion is, and per-net layer cost knobs cannot redistribute that without spatial awareness.

Still Open

Six floor failures, mostly filter islands and diff-pair partners

AVDD12_FILT, AVDD33_FILT, PCKIN_FILT (filter-island family near the BGA, the corridor-contention pattern that the topology detector landed for but the Steiner phase has not been built for yet), CSI_CKN, HDMID1N (diff-pair partners — when one side wins the channel the other is starved, requires true coupled 2-net A*), DDC_SDA. Closing these needs the SOTA gaps still listed in priority order on the survey page: filter-island Steiner with hard reservations, true coupled diff-pair routing, and per-pair bundle planning.

Current Status

Best deterministic single-round score on the compact board is now 0 shorts / 6 unique opened nets (seed=42), down from 0 / 12 at the start of the session. The pin-access oracle in two iterations carried the entire delta; every other architectural attempt this session (filter-island priority, long-traversal priority, F.Cu penalty, In1.Cu discount, diff-pair offset twin) either broke even or regressed. The remaining six failures are filter islands and diff-pair partners — they need the still-open SOTA gaps, not parameter tuning.

The next engineering rule is stricter than the last: if a single feature flag flip changes the score, the score was placement-luck, not router quality. Seed the placer, then measure architectural changes against a fixed problem.