The Five Hundred Iterations You Run Twice

March 2026 Part 5 Performance Streaming

This is Part 5 of "What If Probabilistic Programming Were Different?" On warm-starting NUTS for streaming inference, and what the BEAM makes trivial that other runtimes make hard.


Consider a monitoring system that tracks 120 industrial sensors every twenty minutes. Each sensor runs its own Bayesian model — a state-space specification with 8 parameters, fit via NUTS on a rolling window of 200 observations. The posterior feeds an anomaly detector: when the model's predictive distribution can't explain the latest reading, something has changed.

This happens on 44 concurrent CPU workers, around the clock. Each NUTS run: 500 warmup iterations to adapt the mass matrix and step size, then 200 sampling iterations to draw from the posterior.

The number that should have bothered us earlier: 500.

The Waste

Five hundred warmup iterations — for a model that ran twenty minutes ago, on the same sensor, with nearly identical data. The rolling window shifted by one reading. The posterior moved by a fraction of a percent. And we were spending 500 iterations rediscovering a mass matrix that was, for all practical purposes, the same one we'd just thrown away.

What Warmup Actually Does

NUTS warmup has three jobs. The first fifty iterations find a reasonable step size — the leapfrog integrator's stride length. The next several hundred iterations estimate the mass matrix — a diagonal scaling that accounts for the posterior's different variances along different dimensions. The final fifty iterations fine-tune the step size for the newly estimated mass matrix.

The mass matrix is the expensive part. It requires accumulating sufficient statistics (Welford online variance) across hundreds of gradient evaluations, each of which runs through the XLA JIT compiler. For a d=8 model, each warmup iteration costs about 4 milliseconds. Five hundred warmup iterations: two full seconds, burned before drawing a single useful sample.

The mass matrix from twenty minutes ago was sitting right there, in the process's state. Nobody was using it.

The Observation

The mass matrix captures the posterior geometry: which parameters vary a lot and which are tightly constrained. When the data changes by one reading out of two hundred, the posterior geometry barely shifts. The mass matrix from the previous run is not merely a good starting point — it is, within numerical tolerance, the correct answer.

The only thing warmup needs to do, when given a previous mass matrix, is a short fine-tuning pass: fifty iterations of dual averaging to nudge the step size for whatever minor changes the new data introduced.

The Fix

One keyword argument:

{trace, stats} = Sampler.sample(ir, init,
  num_warmup: 200,
  num_samples: 200,
  warm_start: previous_stats
)

When warm_start is provided — a map containing step_size and inv_mass_diag from the previous run — the sampler skips mass matrix initialization, skips the step-size search, and runs only 50 iterations of fine-tuning instead of the full 500.

The monitoring process already stores stats from the last update. The change to pass it forward was three lines.

The Numbers

Cold startWarm start
Warmup iterations50050
Wall time1,979 ms339 ms
Step size0.8280.749
Speedup5.8x

The step sizes converge to neighboring values — 0.83 versus 0.75 — confirming that the previous mass matrix needed only minor adjustment.

For the full system: 120 sensors, each saving 1.6 seconds per update cycle. Over a 24-hour monitoring day (72 update cycles): 120 × 72 × 1.6 = 13,824 seconds = 3.8 hours of sampling time saved per day. On 44 concurrent workers, this translates to roughly 5 minutes of wall time per cycle freed up — headroom for more sensors or faster anomaly response.

What the BEAM Made Easy

The warm-start pattern is trivial on the BEAM because the GenServer state persists between sampling rounds. The previous stats map lives in the monitoring process's heap — no serialization, no Redis, no filesystem checkpoint. It is simply there, in memory, from the last time the process ran Sampler.sample.

In Python, achieving the same thing requires either pickling the mass matrix to disk between invocations, or maintaining a long-lived process with mutable state and explicit lifecycle management. In Stan, there is no mechanism at all — each chain starts from the identity mass matrix, every time.

The BEAM's contribution is not the algorithm (any sampler could accept a warm-start parameter) but the zero-cost persistence: the GenServer's state is the warm-start cache, supervised, garbage-collected, and available without a single line of serialization code.

When It Applies

Warm-start is useful whenever:

It does not help when:

The system handles this automatically: the first sampling round for each sensor runs full warmup. Every subsequent round checks for previous stats and takes the warm path if available.

The Lesson

Not every warmup is warmup. When the posterior moves slowly — and in a streaming system with rolling windows, it almost always does — the previous run's adaptation is next run's initialization. The 500 iterations we were burning were not exploring unknown territory. They were retracing steps, at four milliseconds each, to arrive at a mass matrix indistinguishable from the one they started with.

The fix was not clever. It was the absence of waste.


eXMC is an open-source probabilistic programming framework for Elixir. The warm-start feature is available via the warm_start: option in Sampler.sample/3. Source: github.com/borodark/eXMC