The nx_vulkan test suite finished in eleven seconds.
152 tests, 0 failures. Five seconds later, the eXMC
test suite started against the same backend and began emitting
hierarchical models — Radon, Eight Schools, Centred and
Non-Centred parameterisations — each one drawing posterior
samples on an RTX 3060 Ti through a Vulkan compute pipeline that
two months ago did not exist. The Radon model finished in a
hundred and four seconds. The eXMC suite kept going. Nothing
asked for CUDA. Nothing asked permission.
In another window, the merged form of the Bastille adapter sat at
daea21a on the zed branch, 540 lines of Elixir that
turn the FreeBSD jail manager’s soft CLI contract into a
typed API a deploy tool can rely on. In a third window, the
research/fast-kernels-applicability branch held the
note explaining why Emily’s named-kernel pattern, despite
being architecturally elegant, is the wrong choice for our
workload class — and how we know.
None of those three things, by itself, is the milestone. The milestone is what they add up to.
The mountain
For the last decade, GPU machine learning on the BEAM has meant
exactly one thing: EXLA. Which means
XLA. Which means, on a GPU, CUDA. The
chain of dependencies is not optional. Anyone who has tried to
run an Nx workload on a host without a green NVIDIA sticker
knows how short the conversation is.
That chain has consequences. It means BEAM ML, in practice, runs on Linux. It means macOS users got Apple’s MPS backend eventually, but only because Apple wrote it. It means FreeBSD users got nothing — not because Vulkan or OpenCL drivers are missing on FreeBSD, but because EXLA presupposes the entire NVIDIA proprietary stack, and that stack has never been a first-class FreeBSD citizen and is unlikely to become one. CUDA is a moat the size of a continent.
The default response to a moat is to swim. Pick Linux. Bring a Docker image. Pretend the rest of UNIX-the-tradition does not exist. This is what almost everyone does, and it is sensible because the alternative requires writing a compute backend.
We wrote the compute backend.
What “walkable” means in concrete terms
A walkable path means a particular sentence is now true. The
sentence is: I can write a probabilistic model in Elixir on
my laptop, ship it as a release tarball, drop it into a FreeBSD
jail managed by zed, and have it sample on
the host’s GPU using only Vulkan drivers and no CUDA.
That sentence has nine clauses. In February, three of them did
not work. Today they all do.
| Clause | Status in February | Status today |
|---|---|---|
| Probabilistic model in Elixir | eXMC, working | eXMC, working |
| Sample with NUTS | Working on EXLA, slow on BinaryBackend | Working on EXLA, working on Nx.Vulkan, working on EMLX |
| Ship as a release tarball | Burrito, working | Burrito, working |
| Land in a FreeBSD jail | Manual via shell scripts | Declarative via zed converge |
| Jail manager talks to bastille | Did not exist | 540-line adapter; 5/0 live integration tests |
| Use the host’s GPU | EXLA + CUDA only | Nx.Vulkan.Backend; pluggable JIT |
| No CUDA dependency | Unreachable | Reachable; Vulkan-only path proven |
| Same Elixir code on Linux/macOS/FreeBSD | Linux only for GPU | Cross-platform, JIT picked at config time |
| Production-grade quantitative measurements | Trial running, no Vulkan participation | 2.65x wall-time speedup on Vulkan at d=22; ESS/s parity at small d |
Each row is the punchline of a different month. Most of them have their own blog post. What none of those posts said is what the table says collectively: the work assembled itself. We did not set out to build a vertically integrated FreeBSD-BEAM-GPU stack. We set out to build several independent pieces, each because it was the next obvious thing. They turn out to compose.
The detour we paid for
The path was not straight. Six weeks ago we read Emily —
the elegant, 141-line compiler that ships specialised fused
kernels through Nx.Defn.Expr.optional. Their
pattern is, frankly, lovely. Each named kernel is fifteen lines.
The compiler is so small you could put it in a tweet. We
refactored eXMC’s leapfrog to use it, expecting the kind
of speedup the architecture promises.
The leapfrog became nine times slower. Each
Expr.optional indirection cost roughly seven
hundred microseconds — function-exported lookup, dynamic
dispatch, Rustler resource decode, shape validation. Six
elementwise ops in a NUTS body times seven hundred
microseconds is forty-two hundred microseconds. The
microbenchmark reported four thousand three hundred. Emily
was not wrong. Emily was the wrong tool for vectors of size
eight, which is what an MCMC sampler at typical Bayesian
dimensions actually deals with.
The IR walker we had been about to delete — the
Nx.Vulkan.Compiler that detects right-folded
chains and emits one fused dispatch per recognised pattern
— turned out to be the right architecture for our
workload. We kept it. We ship the named-kernel module as an
opt-in API for callers who know their tensor sizes. The
research note explaining why is in the repository, where the
next person who reads Emily and gets excited can find it
before they refactor anything.
The lesson is short and worth saying once: before adopting an architectural pattern from another project, write the smallest microbenchmark that compares the pattern’s per-call cost to the existing path on representative inputs. The break-even depends on workload, not philosophy.
What zed gets out of this
zed is the declarative deploy tool. A BEAM-only
replacement for the kind of imperative deploy infrastructure
that produces a different bash script per environment. It
targets FreeBSD and illumos because those are the platforms
that have ZFS and jails/zones in the base system. It uses ZFS
user properties as a state store, replacing etcd and consul. It
treats convergence as a four-phase pipeline (diff, plan, apply,
verify) with snapshot-backed rollback that runs in constant
time regardless of dataset size.
Zed’s value proposition, before this week, had a hole in it. We could declaratively describe a FreeBSD jail. We could provision it. We could deploy a BEAM release into it. But if the release wanted GPU acceleration, the answer was: it doesn’t get any. Half the value of running ML workloads on FreeBSD is moot if you have to give back the GPU.
The hole is filled now. zed converge against a
FreeBSD host with a Vulkan-capable GPU can deploy an eXMC
service that uses that GPU through Nx.Vulkan. The
configuration is one line: config :exmc, :compiler,
:vulkan. The same release tarball that runs CPU-only on
a host without a GPU runs GPU-accelerated on a host with one,
because the JIT is picked at startup based on what the runtime
finds. Cross-platform compute via Vulkan is what zed always
needed, and now it has it.
What is not yet done
The path is walkable, not paved. There is a list.
- FreeBSD hardware bring-up. Everything described above runs on
the Linux dev box because that is where
nx_vulkan’s NIF currently builds cleanly. The shaders, compiled by spirit on macOS, are platform-independent SPIR-V. The wiring should work on FreeBSD with the right GPU drivers; nobody has run it there yet. That is the next milestone, and it is the one that lets the sentence in the “walkable” section above stop being a promise and start being a measurement. - The reduce_scalar mystery. Earlier benchmarks reported
eight argument errors against
Nx.Vulkan.Native.reduce_scalar/3. The current main passes 152 / 0. Either the recent fused-chain work incidentally fixed them or they live in a downstream code path we have not exercised yet. There is a four-mode triage plan; the next test run against eXMC’s full suite under the Vulkan compiler will say which. - Vulkan f64 full-reduce. The single-axis reduction has both an f32 and f64 NIF. The full-axis reduction has only the f32 path; f64 falls back to host materialisation. For models that need double precision throughout (rare in MCMC; common in finance) this is the gap.
- The SMC-Ex notebook. The Emily-pattern research note
predicts that
Nx.Vulkan.Fast.normal_logpdfwins for particle filters with high-dimensional observation models. The notebook that empirically locates the crossover — pure Elixir vs. Vulkan IR walker vs. Fast kernel, swept across observation dimension — is planned but not built. It is the empirical companion to the prediction, and the place where “workload-conditional” gets a measured answer instead of an asserted one. - Hex publish.
nx_vulkanis currently a path dep. Until it’s on Hex with a real version, “use Vulkan instead of CUDA” is something that requires checking out three repositories and aligning their main branches. That is a ceremony only its authors should have to perform.
Coda
The mountain of CUDA sophistication is still there. NVIDIA still ships the deepest, most heavily optimised compute stack of any hardware vendor. cuBLAS still beats every alternative on dense linear algebra by a margin that is sometimes embarrassing. XLA’s graph-level optimisations are real and we are not going to match them at our scale.
That is fine. The point was never to climb the mountain. The point was to demonstrate that we do not have to. There is a walkable path, through unglamorous Vulkan compute pipelines and unflashy SPIR-V shaders, and it leads to the same place: a working GPU-accelerated probabilistic-programming workload running on a host the BEAM was always supposed to be able to deploy to.
The shady shaders did the work. The path is open.
Repositories: nx_vulkan
(main at d2873bb),
zed
(main includes the Bastille adapter at daea21a),
eXMC at the path-pinned dev workspace. The eXMC test suite has
been running against nx_vulkan main throughout the
writing of this note; nothing has failed yet, and the running
processes that hold port 4000 belong to a live trading trial
that has been up since April 19. The mountain is still there.
We just stopped looking at it.