A Walkable Path Under the Mountain

The nx_vulkan test suite finished in eleven seconds. 152 tests, 0 failures. Five seconds later, the eXMC test suite started against the same backend and began emitting hierarchical models — Radon, Eight Schools, Centred and Non-Centred parameterisations — each one drawing posterior samples on an RTX 3060 Ti through a Vulkan compute pipeline that two months ago did not exist. The Radon model finished in a hundred and four seconds. The eXMC suite kept going. Nothing asked for CUDA. Nothing asked permission.

In another window, the merged form of the Bastille adapter sat at daea21a on the zed branch, 540 lines of Elixir that turn the FreeBSD jail manager’s soft CLI contract into a typed API a deploy tool can rely on. In a third window, the research/fast-kernels-applicability branch held the note explaining why Emily’s named-kernel pattern, despite being architecturally elegant, is the wrong choice for our workload class — and how we know.

None of those three things, by itself, is the milestone. The milestone is what they add up to.

The mountain

For the last decade, GPU machine learning on the BEAM has meant exactly one thing: EXLA. Which means XLA. Which means, on a GPU, CUDA. The chain of dependencies is not optional. Anyone who has tried to run an Nx workload on a host without a green NVIDIA sticker knows how short the conversation is.

That chain has consequences. It means BEAM ML, in practice, runs on Linux. It means macOS users got Apple’s MPS backend eventually, but only because Apple wrote it. It means FreeBSD users got nothing — not because Vulkan or OpenCL drivers are missing on FreeBSD, but because EXLA presupposes the entire NVIDIA proprietary stack, and that stack has never been a first-class FreeBSD citizen and is unlikely to become one. CUDA is a moat the size of a continent.

The default response to a moat is to swim. Pick Linux. Bring a Docker image. Pretend the rest of UNIX-the-tradition does not exist. This is what almost everyone does, and it is sensible because the alternative requires writing a compute backend.

We wrote the compute backend.

What “walkable” means in concrete terms

A walkable path means a particular sentence is now true. The sentence is: I can write a probabilistic model in Elixir on my laptop, ship it as a release tarball, drop it into a FreeBSD jail managed by zed, and have it sample on the host’s GPU using only Vulkan drivers and no CUDA. That sentence has nine clauses. In February, three of them did not work. Today they all do.

Clause	Status in February	Status today
Probabilistic model in Elixir	eXMC, working	eXMC, working
Sample with NUTS	Working on EXLA, slow on BinaryBackend	Working on EXLA, working on Nx.Vulkan, working on EMLX
Ship as a release tarball	Burrito, working	Burrito, working
Land in a FreeBSD jail	Manual via shell scripts	Declarative via `zed converge`
Jail manager talks to bastille	Did not exist	540-line adapter; 5/0 live integration tests
Use the host’s GPU	EXLA + CUDA only	Nx.Vulkan.Backend; pluggable JIT
No CUDA dependency	Unreachable	Reachable; Vulkan-only path proven
Same Elixir code on Linux/macOS/FreeBSD	Linux only for GPU	Cross-platform, JIT picked at config time
Production-grade quantitative measurements	Trial running, no Vulkan participation	2.65x wall-time speedup on Vulkan at d=22; ESS/s parity at small d

Each row is the punchline of a different month. Most of them have their own blog post. What none of those posts said is what the table says collectively: the work assembled itself. We did not set out to build a vertically integrated FreeBSD-BEAM-GPU stack. We set out to build several independent pieces, each because it was the next obvious thing. They turn out to compose.

The detour we paid for

The path was not straight. Six weeks ago we read Emily — the elegant, 141-line compiler that ships specialised fused kernels through Nx.Defn.Expr.optional. Their pattern is, frankly, lovely. Each named kernel is fifteen lines. The compiler is so small you could put it in a tweet. We refactored eXMC’s leapfrog to use it, expecting the kind of speedup the architecture promises.

The leapfrog became nine times slower. Each Expr.optional indirection cost roughly seven hundred microseconds — function-exported lookup, dynamic dispatch, Rustler resource decode, shape validation. Six elementwise ops in a NUTS body times seven hundred microseconds is forty-two hundred microseconds. The microbenchmark reported four thousand three hundred. Emily was not wrong. Emily was the wrong tool for vectors of size eight, which is what an MCMC sampler at typical Bayesian dimensions actually deals with.

The IR walker we had been about to delete — the Nx.Vulkan.Compiler that detects right-folded chains and emits one fused dispatch per recognised pattern — turned out to be the right architecture for our workload. We kept it. We ship the named-kernel module as an opt-in API for callers who know their tensor sizes. The research note explaining why is in the repository, where the next person who reads Emily and gets excited can find it before they refactor anything.

The lesson is short and worth saying once: before adopting an architectural pattern from another project, write the smallest microbenchmark that compares the pattern’s per-call cost to the existing path on representative inputs. The break-even depends on workload, not philosophy.

What zed gets out of this

zed is the declarative deploy tool. A BEAM-only replacement for the kind of imperative deploy infrastructure that produces a different bash script per environment. It targets FreeBSD and illumos because those are the platforms that have ZFS and jails/zones in the base system. It uses ZFS user properties as a state store, replacing etcd and consul. It treats convergence as a four-phase pipeline (diff, plan, apply, verify) with snapshot-backed rollback that runs in constant time regardless of dataset size.

Zed’s value proposition, before this week, had a hole in it. We could declaratively describe a FreeBSD jail. We could provision it. We could deploy a BEAM release into it. But if the release wanted GPU acceleration, the answer was: it doesn’t get any. Half the value of running ML workloads on FreeBSD is moot if you have to give back the GPU.

The hole is filled now. zed converge against a FreeBSD host with a Vulkan-capable GPU can deploy an eXMC service that uses that GPU through Nx.Vulkan. The configuration is one line: config :exmc, :compiler, :vulkan. The same release tarball that runs CPU-only on a host without a GPU runs GPU-accelerated on a host with one, because the JIT is picked at startup based on what the runtime finds. Cross-platform compute via Vulkan is what zed always needed, and now it has it.

What is not yet done

The path is walkable, not paved. There is a list.

FreeBSD hardware bring-up. Everything described above runs on the Linux dev box because that is where nx_vulkan’s NIF currently builds cleanly. The shaders, compiled by spirit on macOS, are platform-independent SPIR-V. The wiring should work on FreeBSD with the right GPU drivers; nobody has run it there yet. That is the next milestone, and it is the one that lets the sentence in the “walkable” section above stop being a promise and start being a measurement.
The reduce_scalar mystery. Earlier benchmarks reported eight argument errors against Nx.Vulkan.Native.reduce_scalar/3. The current main passes 152 / 0. Either the recent fused-chain work incidentally fixed them or they live in a downstream code path we have not exercised yet. There is a four-mode triage plan; the next test run against eXMC’s full suite under the Vulkan compiler will say which.
Vulkan f64 full-reduce. The single-axis reduction has both an f32 and f64 NIF. The full-axis reduction has only the f32 path; f64 falls back to host materialisation. For models that need double precision throughout (rare in MCMC; common in finance) this is the gap.
The SMC-Ex notebook. The Emily-pattern research note predicts that Nx.Vulkan.Fast.normal_logpdf wins for particle filters with high-dimensional observation models. The notebook that empirically locates the crossover — pure Elixir vs. Vulkan IR walker vs. Fast kernel, swept across observation dimension — is planned but not built. It is the empirical companion to the prediction, and the place where “workload-conditional” gets a measured answer instead of an asserted one.
Hex publish. nx_vulkan is currently a path dep. Until it’s on Hex with a real version, “use Vulkan instead of CUDA” is something that requires checking out three repositories and aligning their main branches. That is a ceremony only its authors should have to perform.

Coda

The mountain of CUDA sophistication is still there. NVIDIA still ships the deepest, most heavily optimised compute stack of any hardware vendor. cuBLAS still beats every alternative on dense linear algebra by a margin that is sometimes embarrassing. XLA’s graph-level optimisations are real and we are not going to match them at our scale.

That is fine. The point was never to climb the mountain. The point was to demonstrate that we do not have to. There is a walkable path, through unglamorous Vulkan compute pipelines and unflashy SPIR-V shaders, and it leads to the same place: a working GPU-accelerated probabilistic-programming workload running on a host the BEAM was always supposed to be able to deploy to.

The shady shaders did the work. The path is open.

Repositories: nx_vulkan (main at d2873bb), zed (main includes the Bastille adapter at daea21a), eXMC at the path-pinned dev workspace. The eXMC test suite has been running against nx_vulkan main throughout the writing of this note; nothing has failed yet, and the running processes that hold port 4000 belong to a live trading trial that has been up since April 19. The mountain is still there. We just stopped looking at it.