34× Slower to 1.9× Faster

This is Part 3 of "What If Probabilistic Programming Were Different?" Part 1 introduced the BEAM process runtime thesis. Part 2 showed feature parity and benchmark numbers. This part tells the story of how we got there — and what each step reveals about building a PPL on an interpreted runtime.

eXMC's first benchmark against PyMC was sobering. The medium hierarchical model (5 parameters, 2 groups): 3.3 ESS/s vs 113 ESS/s. A 34× gap.

Today, that same model runs at 298 ESS/s — 1.90× faster than PyMC. The journey from 34× slower to 1.9× faster took seven distinct optimization phases, each of which taught something about the nature of mixed-runtime PPL design. None required changing the NUTS algorithm. All of them were about understanding where computational work actually happens and why.

Phase 0: The Starting Point

The initial sampler was naive. It ran NUTS correctly — the posterior was right, the diagnostics were fine — but every operation went through Nx tensor operations, and every tensor lived on whatever backend EXLA's JIT returned.

Medium model: 3.3 ESS/s
PyMC:         113 ESS/s
Gap:          34×

The profiler showed the problem immediately: trivial scalar operations (adding two numbers, checking a sign) were taking 70–150 microseconds each because they dispatched through the EXLA runtime.

Phase 1: The JIT Boundary Discovery (34× → 23×)

The NUTS tree builder does two kinds of work. Gradient computation is the expensive part — matrix multiplies, automatic differentiation. This should be JIT-compiled. Tree logic is the cheap part — "did the trajectory turn around?", "which proposal do we accept?" This is scalar arithmetic and branching.

The problem: these two kinds of work shared the same tensor backend. EXLA's JIT-compiled step_fn returned tensors on EXLA.Backend. These flowed into the tree builder, where every Nx.add dispatched through the EXLA runtime — even for single numbers.

Operation	EXLA.Backend	BinaryBackend	Ratio
`Nx.to_number(Nx.sum(...))`	150 μs	5 μs	30×
`Nx.add`	69 μs	7 μs	9×

The fix was four lines: Nx.backend_copy(tensor, Nx.BinaryBackend) on step_fn outputs at the tree leaf level. Simple: 70→211 ESS/s (3×).

The lesson: In a mixed-runtime PPL, the placement of the JIT boundary determines performance more than any algorithmic choice.

Phase 2: Batched Leapfrog (23× → 19×)

Instead of one JIT call per leapfrog step, batch an entire subtree's trajectory into a single XLA while loop. 64 steps: 16,640 μs individual vs 408 μs batched (40.8×).

But integration was treacherous. The first approach — pre-compute all states, build a merge tree bottom-up — looked correct. Per-call verification: 0/50 mismatches. But chained over 1000 iterations, ESS collapsed from 78 to 12. A 6.5× quality regression from a "correct" optimization.

The root cause: a bottom-up merge tree and the top-down recursive build_subtree consume random numbers in a different order. The fix: pre-compute the leapfrog states, but inject them into the exact same recursive structure via a cached step function and an Erlang :atomics counter.

The lesson: When parallelizing a sequential recursive algorithm, the original recursive structure must be preserved exactly.

Phase 3: The Parameterization Discovery (19× → 2.3×)

This was the biggest single improvement, and it wasn't a speed optimization at all.

eXMC had automatic non-centered parameterization (NCP). For our well-identified benchmark model, NCP was actively harmful — it destroyed sigma_obs ESS while making alpha_raw perfect. Centered parameterization gave 9× better ESS/s.

Parameterization	ESS/s	Step size	sigma_obs ESS
NCP (automatic)	5.5	0.034	4–158
Centered (`ncp: false`)	49.1	0.46	141–893

The lesson: Before optimizing speed, check that you're solving the right problem. A 9× ESS improvement from parameterization dwarfs months of micro-optimization work.

Phase 4: Adaptation Quality (2.3× → 1.5×)

Three Stan practices, found only in source code, never in papers:

Practice 1: Per-window Welford reset. Stan resets the mass matrix estimator at each adaptation window boundary. Impact: +30% medium ESS/s.

Practice 2: Divergent sample exclusion. Stan excludes divergent transitions from mass matrix estimation. Impact: +109% medium, +180% stress.

Practice 3: term_buffer = 50, not 200. The extra 150 iterations wasted on Phase III could have been Phase II mass matrix adaptation — +43% more samples for the final window.

The lesson: PPL performance depends on implementation folklore, not just algorithmic specification. Five undocumented practices collectively account for a 10× performance difference.

Phase 5: The Rust NIF Experiment

Three granularity levels attempted:

Strategy	Scope	Result
Replace entire outer loop	All tree logic in Rust	0.86× (slower)
Replace inner subtree only	Merge logic in Rust	1.5×
Replace full tree build	Everything after JIT in Rust	1.67× simple, 0.47× medium

The outer-loop NIF failed because each iteration crossed both Elixir↔XLA and Elixir↔Rust. The double boundary crossing added more overhead than Rust saved.

The lesson: FFI boundary granularity matters more than FFI language choice.

Phase 6: The U-Turn Criterion

The stress model (200× range in inverse mass matrix) stubbornly lagged. The endpoint U-turn criterion was dominated by the single highest-variance component. Betancourt's generalized criterion — tracking cumulative momentum ρ across the trajectory — removed the bias: +33% stress ESS/s, deeper trees, larger step sizes.

Sub-trajectory U-turn checks (another undocumented Stan practice — 3 checks per merge instead of 1) gave another +46% on medium.

Phase 7: The Multinomial Bugs (The Breakthrough)

Two bugs hiding in plain sight since the beginning. Both produced valid posteriors. Both passed every test. Together, they inflated the duplicate sample rate from 7.8% (PyMC) to 37.7%.

Bug 1: Capped log-weights. Trajectory point weights were capped at exp(0) = 1. Points with better energy were systematically underweighted.

Bug 2: Wrong outer merge formula. Stan uses biased progressive sampling for outer merges but balanced multinomial for inner merges. eXMC used balanced for both, making the starting point "sticky" — selected 25% of the time at depth 2 instead of PyMC's 3.7%.

Duplicate rate: 37.7% → 6.5%

Simple: 233 → 469 ESS/s (+101%)    0.81× PyMC
Medium: 116 → 298 ESS/s (+157%)    1.90× PyMC  ← beats PyMC
Stress: 107 → 215 ESS/s (+101%)    1.16× PyMC  ← beats PyMC

The lesson: MCMC correctness is robust — wrong weights still produce valid posteriors. But efficiency is fragile. A 37.7% duplicate rate means over a third of your compute produces no new information.

The Full Trajectory

Phase	Change	Medium ESS/s	Gap vs PyMC
0	Baseline	3.3	34×
1	JIT boundary fix + micro-opts	4.9	23×
2	Batched leapfrog	4.6	25×
3	Centered parameterization	49.1	2.3×
4	Stan adaptation practices	116	~1.0×
5	Rust NIF (inner subtree)	~120	~0.9×
6	Rho U-turn + sub-trajectory	~120	~1.0×
7	Multinomial fixes	298	1.90× faster

90× improvement from Phase 0 to Phase 7. No change to the core NUTS algorithm. Every improvement was about understanding where the work happens and why.

What This Teaches About PPL Design

1. JIT boundary placement is a first-class design concern. This problem doesn't exist if your entire PPL is in one language. It's the central performance challenge of mixed-runtime PPLs.

2. Implementation folklore is load-bearing. Five undocumented practices, none in any paper, collectively account for a 10× performance difference. Publishing algorithms without implementation details is publishing half the story.

3. Correctness and efficiency are orthogonal. MCMC is uniquely forgiving — wrong weights, wrong merge formulas, wrong U-turn criteria all produce valid posteriors. PPL development needs explicit efficiency diagnostics: duplicate rate, per-parameter ESS, index-in-trajectory distributions.

4. The biggest wins aren't speed optimizations. Parameterization (9×), multinomial correctness (2–3×), and adaptation quality (3×) all improve how much information each sample contains. Per-step speed improvements were smaller, harder, and less generalizable.

5. Three languages is one too many — unless you manage the boundaries. eXMC's Elixir/Rust/XLA stack requires managing two FFI boundaries. The optimal architecture: JIT the gradient (XLA), interpret the tree (Elixir), and optionally accelerate the inner subtree (Rust) — but never let a single operation cross two boundaries.

The gap between 34× slower and 1.9× faster wasn't bridged by making Elixir fast. It was bridged by understanding what "fast" means in a mixed-runtime system: put computation on the right side of the boundary, use the right parameterization, match the undocumented practices of mature implementations, and make sure your multinomial sampling actually works.

The BEAM gave us streaming, fault tolerance, and distribution for free. The price was figuring out where the JIT boundary should be. That price has been paid.

The thesis — "Probabilistic Programming on BEAM Process Runtimes" — contains the full technical analysis with reproducible benchmarks, cost models, and 52 documented architectural decisions. The Amber Trace tells the literary version of this story.