On the improbable and slightly reckless decision to build a probabilistic programming framework on a virtual machine designed for telephone switches
I. Against the GIL
It is a peculiar feature of the Python programming language that it cannot do two things at once. Not metaphorically. Literally: the Global Interpreter Lock ensures that no matter how many cores your machine possesses, no matter how embarrassingly parallel your workload, the interpreter will execute one thread at a time, politely queuing the rest.
This would be merely an annoyance if Python had not become the lingua franca of scientific computing. PyMC — the finest probabilistic programming framework extant — must run its Markov chains in this environment. Four chains means four operating system processes. Four copies of the model in memory. Four separate JIT compilations. Four interpreters, each sealed behind its own GIL, communicating through serialized pickle streams.
The BEAM virtual machine takes the opposite view. Concurrency is not a feature to be bolted on. It is the substrate. Lightweight processes — millions of them — share a heap, communicate through immutable messages, and are scheduled preemptively across all available cores. Four MCMC chains are four processes sharing one compiled model, dispatched with a single call to Task.async_stream.
The question was whether the mathematics could be made to follow the runtime. Whether a virtual machine designed for telephone switches could be taught Hamiltonian Monte Carlo.
II. Memoranda
Every non-trivial project is a sequence of bets, and the honest thing to do is write them down.
eXMC's DECISIONS.md grew to fifty-two entries. Each records a decision, a rationale, an assumption that must hold for the decision to remain valid, and the implications if it does. Writing decisions down has the inconvenient property of making them falsifiable, which is precisely why one should do it.
Decision one: use Nx, Elixir's tensor library, as the numerical backbone. Decision two: represent models as a small intermediate representation before compiling them to differentiable closures. Decision four: distributions declare their own constraint transforms. The user is spared the medieval torture of manually applying Jacobian corrections.
The most instructive entry is not a decision that held, but one that didn't. Decision twelve declared that observation log-probability terms could be computed eagerly at compile time. Observed values are constants, the reasoning went, so their contribution is fixed. Elegant. Efficient. Wrong.
Decision twenty-two introduced hierarchical parameter references — string values like "mu" that point to other random variables. The observation term was no longer fixed. Decision twelve's note reads: "Partially superseded by D22." This is the most valuable sentence in the entire document. Not because it describes a failure, but because it records the failure honestly.
Fifty-two entries. Not a manifesto, but a memorandum — things that must be remembered, including the things we got wrong.
III. One Thousand Times Too Slow
The NUTS sampler requires random numbers. The natural approach in Nx is to use Nx.Random, which provides a pure functional PRNG in the style of JAX: generate a key, split it, draw from it. Deterministic, composable, beautiful.
On BinaryBackend, each call to Nx.Random.split took one second.
The operation triggered Nx's defn tracing machinery — the system that analyzes computation graphs for potential JIT compilation — which discovered there was no JIT compiler available and proceeded to interpret the operation at the speed of molasses flowing uphill in January. A sampler with six hundred iterations and five random draws per step would require approximately three thousand seconds. Nearly an hour. For a single parameter.
The solution: abandon Nx.Random. Use Erlang's built-in :rand module instead. Microseconds per call, as God intended.
rng = :rand.seed_s(:exsss, seed)
{value, rng} = :rand.normal_s(rng)
The lesson is not that Nx.Random is badly designed. It is well designed, for EXLA. Running a JIT-oriented library on an interpreter is not a graceful degradation. It is a thousand-fold performance cliff.
IV. Two Feedback Loops
The sampler's first successful run produced samples from a standard Normal distribution. Mean: approximately zero. Variance: approximately 1.5. Wrong by fifty percent, consistently. Not stochastic noise — systematic bias.
The culprit was found in the NUTS tree builder: when extending a subtree, the code used the gradient at the proposal position rather than the gradient at the endpoint. The Hamiltonian dynamics became subtly wrong — not wrong enough to crash, but wrong enough to inflate the posterior variance by fifty percent.
The second feedback-loop bug was more elegant in its stupidity. The dual averaging algorithm adapted the step size. It observed acceptance rates. It computed smoothed estimates. It produced excellent step sizes. Nobody was using them. The updated epsilon was never fed back into the next step. The adaptation machinery was running in perfect isolation, producing correct answers that vanished into the void.
In both cases, every component was doing its job. The failure was in the handoff. One is reminded of institutions that produce excellent policy papers which are then filed, unread, in the appropriate cabinet.
V. The Textbook Lied
There are two formulas in the standard numerical computing curriculum that do not work, and eXMC discovered both the hard way.
The first: log1p(exp(x)), the textbook softplus. For large x, exp(x) overflows to infinity. The stable replacement: max(x, 0) + log1p(exp(-|x|)). Mathematically identical. Numerically indestructible.
The second: lgamma, the logarithm of the gamma function. The Lanczos approximation was implemented faithfully. Forward evaluation: flawless. Gradient via automatic differentiation: catastrophic. Differentiating c / (x + i) produces terms that trigger Elixir's Complex.divide in the tails. NaN propagates through the gradient, through the leapfrog step, through the tree, into the sampler.
Two textbook formulas. Both correct in the platonic realm of exact arithmetic. Both wrong on the hardware where the computation actually runs. The stable replacements are not more sophisticated. They are simply honest about the machine they run on.
VI. The Productive Contradiction
Decision fourteen: "The NUTS sampler is implemented with plain Elixir functions and Nx tensor ops, not defn." Decision twenty-four: "Compiler.value_and_grad auto-detects EXLA and wraps the logp closure in EXLA.jit."
A direct contradiction. Both correct.
Decision fourteen is correct about the sampler loop: tree building, U-turn detection, multinomial proposal selection. Branchy, stateful, scalar-heavy code. Decision twenty-four is correct about the gradient: the tensor-heavy, branch-free function that XLA was built to optimize.
Three number systems in one sampler. Erlang floats for adaptation. Nx tensors for the loop. EXLA-compiled closures for the gradient. The document contains its own refutation, and the refutation makes the original decision stronger.
VII. The Geometry of Funnels
Hierarchical Bayesian models have a well-known pathology called funnel geometry. When a parent controls the variance of a child, and the parent is small, the child is pinched into a narrow corridor. The sampler cannot serve both regimes with a single step size.
The standard remedy is non-centered parameterization. In PyMC, the user must apply this by hand. eXMC automates it: the NonCenteredParameterization rewrite pass scans the IR for the exact configuration that produces funnels and rewrites the geometry. The user's model and trace are unchanged. Only the sampler's exploration is transformed.
This is what a rewrite pipeline is for: to apply mathematical knowledge that the user should not be required to possess.
VIII. The Most Dangerous Bug
The compiler function resolve_params resolved string parameter references by looking up values in the current value map. The map held unconstrained values. If sigma had a :log transform, the map contained log(sigma), not sigma.
For simple models near sigma = 1.0, where log(1) = 0, the error was small. The sampler converged to a posterior that was slightly wrong, in a way that required careful analysis to detect. The numbers looked reasonable. The chains mixed. The integration test passed — with the wrong log-probability, producing a wrong posterior that happened to fall within generous tolerances.
For hierarchical models with five free parameters, the error was apocalyptic. Step-size search collapsed to 1e-141. Every sample diverged.
The gap between unconstrained space and constrained space is an abstraction boundary, and abstraction boundaries are precisely where errors hide. The answers that are almost right are far more dangerous than the answers that are obviously wrong.
IX. A Function Called to_number
There is a function in Nx called to_number. Its name is a promise: give it a tensor, receive a number. The promise is, under certain conditions, a lie.
When the tensor contains NaN, it returns the atom :nan. When it contains infinity, the atom :infinity. These are not numbers. Erlang arithmetic on atoms does not return NaN. It crashes the process. If the tree builder calls to_number on a log-probability that happens to be negative infinity — which occurs routinely at constraint boundaries — the process dies without ceremony.
The guards are now permanent fixtures, small sentries posted at every gate where a tensor value crosses into Erlang arithmetic. They are there because a function called to_number sometimes returns atoms.
X. Four Chains, Zero Locks
Sampler.sample_chains(ir, 4, init_values: init) compiles the model once and dispatches four chains in parallel. Each chain is a BEAM process. Each chain gets its own PRNG state. Each chain produces its own trace.
There is no mutex. No semaphore. No atomic compare-and-swap. No lock-free queue. No threading.Lock(). No multiprocessing.Pool(). No pickle serialization.
On a four-core machine, four chains complete in approximately the time of one. This is not a heroic engineering achievement. It is the natural consequence of running concurrent work on a runtime designed, from the first day of its existence, for exactly this purpose.
XI. Honor Cui Honor
ExmcViz renders through Scenic, Elixir's scene graph library, designed for embedded information displays. The scene graph is a live data structure composited at sixty frames per second through OpenGL via NanoVG. There is no image encoding. There is no file I/O.
The live streaming feature connects the sampler directly to the visualization through GenServer message passing. A StreamCoordinator buffers incoming samples and flushes them to the LiveDashboard scene. The trace plots grow from left to right, amber lines crawling across a black field.
The color palette — amber on true black — is not a stylistic affectation. On OLED panels, true black means the pixel is off: zero power, zero photon emission. The warm spectrum preserves scotopic vision during late-night sampling sessions.
Three processes, three mailboxes, zero shared state. This is not a special architecture. It is the default architecture of any concurrent Elixir application.
XII. Two Hundred and Thirty-Three Theses
eXMC has 199 tests. ExmcViz has 34 more. 233 total. Fifty-two architectural decisions, recorded and falsifiable, including the ones that were wrong. Sixteen distributions. Four inference methods. Two textbook formulas replaced with stable alternatives. One function that lies about its return type, guarded at every call site. Three number systems in one sampler. Two feedback loops that computed correct answers and delivered them to nobody. One bug that produced plausible wrong answers for simple models and catastrophic wrong answers for complex ones. One rewrite pass that eliminates funnel geometry without the user's knowledge or consent.
The project works. It works on a runtime that was never intended for this purpose, in a language that has no history in numerical computing, using a tensor library that did not exist five years ago. The posterior distributions it computes are correct. The parallel chains it dispatches are fast. The visualizations it renders are immediate and live. The BEAM has been taught to do Bayesian statistics, and it does them rather well.
Whether anyone asked for this is, I maintain, beside the point. The question is whether the thing is good, and the evidence — two hundred and thirty-three tests of it — suggests that it is.
eXMC: 199 tests, 52 decisions, 16 distributions, 4 layers, one runtime.
ExmcViz: 34 tests, 8 components, 10 chain colors, zero black pixels wasted.
The traces glow amber on black, converging on the truth.
The BEAM does not care that nobody asked.