A Walkable Path Under the Mountain

May 2026 nx_vulkan zed FreeBSD milestone

The nx_vulkan test suite finished in eleven seconds. 152 tests, 0 failures. Five seconds later, the eXMC test suite started against the same backend and began emitting hierarchical models — Radon, Eight Schools, Centred and Non-Centred parameterisations — each one drawing posterior samples on an RTX 3060 Ti through a Vulkan compute pipeline that two months ago did not exist. The Radon model finished in a hundred and four seconds. The eXMC suite kept going. Nothing asked for CUDA. Nothing asked permission.

In another window, the merged form of the Bastille adapter sat at daea21a on the zed branch, 540 lines of Elixir that turn the FreeBSD jail manager’s soft CLI contract into a typed API a deploy tool can rely on. In a third window, the research/fast-kernels-applicability branch held the note explaining why Emily’s named-kernel pattern, despite being architecturally elegant, is the wrong choice for our workload class — and how we know.

None of those three things, by itself, is the milestone. The milestone is what they add up to.

The mountain

For the last decade, GPU machine learning on the BEAM has meant exactly one thing: EXLA. Which means XLA. Which means, on a GPU, CUDA. The chain of dependencies is not optional. Anyone who has tried to run an Nx workload on a host without a green NVIDIA sticker knows how short the conversation is.

That chain has consequences. It means BEAM ML, in practice, runs on Linux. It means macOS users got Apple’s MPS backend eventually, but only because Apple wrote it. It means FreeBSD users got nothing — not because Vulkan or OpenCL drivers are missing on FreeBSD, but because EXLA presupposes the entire NVIDIA proprietary stack, and that stack has never been a first-class FreeBSD citizen and is unlikely to become one. CUDA is a moat the size of a continent.

The default response to a moat is to swim. Pick Linux. Bring a Docker image. Pretend the rest of UNIX-the-tradition does not exist. This is what almost everyone does, and it is sensible because the alternative requires writing a compute backend.

We wrote the compute backend.

What “walkable” means in concrete terms

A walkable path means a particular sentence is now true. The sentence is: I can write a probabilistic model in Elixir on my laptop, ship it as a release tarball, drop it into a FreeBSD jail managed by zed, and have it sample on the host’s GPU using only Vulkan drivers and no CUDA. That sentence has nine clauses. In February, three of them did not work. Today they all do.

ClauseStatus in FebruaryStatus today
Probabilistic model in ElixireXMC, workingeXMC, working
Sample with NUTSWorking on EXLA, slow on BinaryBackendWorking on EXLA, working on Nx.Vulkan, working on EMLX
Ship as a release tarballBurrito, workingBurrito, working
Land in a FreeBSD jailManual via shell scriptsDeclarative via zed converge
Jail manager talks to bastilleDid not exist540-line adapter; 5/0 live integration tests
Use the host’s GPUEXLA + CUDA onlyNx.Vulkan.Backend; pluggable JIT
No CUDA dependencyUnreachableReachable; Vulkan-only path proven
Same Elixir code on Linux/macOS/FreeBSDLinux only for GPUCross-platform, JIT picked at config time
Production-grade quantitative measurementsTrial running, no Vulkan participation2.65x wall-time speedup on Vulkan at d=22; ESS/s parity at small d

Each row is the punchline of a different month. Most of them have their own blog post. What none of those posts said is what the table says collectively: the work assembled itself. We did not set out to build a vertically integrated FreeBSD-BEAM-GPU stack. We set out to build several independent pieces, each because it was the next obvious thing. They turn out to compose.

The detour we paid for

The path was not straight. Six weeks ago we read Emily — the elegant, 141-line compiler that ships specialised fused kernels through Nx.Defn.Expr.optional. Their pattern is, frankly, lovely. Each named kernel is fifteen lines. The compiler is so small you could put it in a tweet. We refactored eXMC’s leapfrog to use it, expecting the kind of speedup the architecture promises.

The leapfrog became nine times slower. Each Expr.optional indirection cost roughly seven hundred microseconds — function-exported lookup, dynamic dispatch, Rustler resource decode, shape validation. Six elementwise ops in a NUTS body times seven hundred microseconds is forty-two hundred microseconds. The microbenchmark reported four thousand three hundred. Emily was not wrong. Emily was the wrong tool for vectors of size eight, which is what an MCMC sampler at typical Bayesian dimensions actually deals with.

The IR walker we had been about to delete — the Nx.Vulkan.Compiler that detects right-folded chains and emits one fused dispatch per recognised pattern — turned out to be the right architecture for our workload. We kept it. We ship the named-kernel module as an opt-in API for callers who know their tensor sizes. The research note explaining why is in the repository, where the next person who reads Emily and gets excited can find it before they refactor anything.

The lesson is short and worth saying once: before adopting an architectural pattern from another project, write the smallest microbenchmark that compares the pattern’s per-call cost to the existing path on representative inputs. The break-even depends on workload, not philosophy.

What zed gets out of this

zed is the declarative deploy tool. A BEAM-only replacement for the kind of imperative deploy infrastructure that produces a different bash script per environment. It targets FreeBSD and illumos because those are the platforms that have ZFS and jails/zones in the base system. It uses ZFS user properties as a state store, replacing etcd and consul. It treats convergence as a four-phase pipeline (diff, plan, apply, verify) with snapshot-backed rollback that runs in constant time regardless of dataset size.

Zed’s value proposition, before this week, had a hole in it. We could declaratively describe a FreeBSD jail. We could provision it. We could deploy a BEAM release into it. But if the release wanted GPU acceleration, the answer was: it doesn’t get any. Half the value of running ML workloads on FreeBSD is moot if you have to give back the GPU.

The hole is filled now. zed converge against a FreeBSD host with a Vulkan-capable GPU can deploy an eXMC service that uses that GPU through Nx.Vulkan. The configuration is one line: config :exmc, :compiler, :vulkan. The same release tarball that runs CPU-only on a host without a GPU runs GPU-accelerated on a host with one, because the JIT is picked at startup based on what the runtime finds. Cross-platform compute via Vulkan is what zed always needed, and now it has it.

What is not yet done

The path is walkable, not paved. There is a list.

Coda

The mountain of CUDA sophistication is still there. NVIDIA still ships the deepest, most heavily optimised compute stack of any hardware vendor. cuBLAS still beats every alternative on dense linear algebra by a margin that is sometimes embarrassing. XLA’s graph-level optimisations are real and we are not going to match them at our scale.

That is fine. The point was never to climb the mountain. The point was to demonstrate that we do not have to. There is a walkable path, through unglamorous Vulkan compute pipelines and unflashy SPIR-V shaders, and it leads to the same place: a working GPU-accelerated probabilistic-programming workload running on a host the BEAM was always supposed to be able to deploy to.

The shady shaders did the work. The path is open.


Repositories: nx_vulkan (main at d2873bb), zed (main includes the Bastille adapter at daea21a), eXMC at the path-pinned dev workspace. The eXMC test suite has been running against nx_vulkan main throughout the writing of this note; nothing has failed yet, and the running processes that hold port 4000 belong to a live trading trial that has been up since April 19. The mountain is still there. We just stopped looking at it.