The first machine to run a GPU compute kernel for the Spirit spin simulator on FreeBSD was a 2013 Mac Pro — the trash-can model — with a GeForce GT 750M bolted on. 384 CUDA cores. Kepler. Released the year of the first iPhone with Touch ID. The kind of card you find in a dead laptop in a drawer. It produced the entirely respectable result of 0.30 milliseconds for an elementwise add over a million single-precision floats, beating the same machine’s CPU by 1.66x and proving that yes, you can in fact do GPU compute on FreeBSD.
That was Tuesday. By Friday, the same backend code, compiled on a Linux desktop, was running on a GeForce RTX 3060 Ti — 4,864 CUDA cores, Ampere, eight gigabytes of GDDR6 — and producing the same elementwise add over four million floats in 0.22 milliseconds. Thirteen times faster than the CPU on the same machine, and seven times faster per element than the GT 750M was on its own machine. Same backend. Same shader. Same five hundred and forty lines of C++.
The lesson is not that the RTX 3060 Ti is a faster GPU than a 2013 GT 750M. It is that the path between Spirit and the GPU went through Vulkan, and Vulkan is the same on FreeBSD as it is on Linux as it is on Windows, and so the work that proved the FreeBSD path is the work that runs unchanged on a real workstation card.
Why Vulkan, not CUDA
Spirit, like most physics simulators with a respectable history, ships a CUDA backend. The CUDA backend is excellent. It is also useless on FreeBSD — NVIDIA does not ship a CUDA driver stack for FreeBSD, has not ever done so, and gives no indication of plans to. The same is true for illumos, OpenBSD, NetBSD, DragonFlyBSD, and any other operating system that doesn’t pay NVIDIA enough in market share to be worth the engineering invest.
Vulkan is different. Vulkan ships as part of the open-source NVIDIA graphics driver. FreeBSD’s NVIDIA driver includes the Vulkan ICD. The Mesa drivers (open-source, BSD-licensed) ship Vulkan for AMD and Intel cards on FreeBSD too. The Vulkan SDK itself is BSD-licensed. The whole stack is portable in a way that CUDA fundamentally is not, because CUDA is the proprietary product and Vulkan is the open standard.
The ergonomics are worse. The boilerplate is worse. The error
messages are worse. The compute primitives that CUDA gives you in
two lines of __global__ require fifty lines of
descriptor set allocation, pipeline layout binding, push constant
configuration, and command buffer recording in Vulkan. This is
not an objection. The ergonomics are worse on FreeBSD too,
because FreeBSD is not the operating system commodity GPU compute
was designed for. We are paying ergonomics for portability, and
we are getting it.
The Mac Pro proof
The first run was on free-macpro-gpu.local, a 2013
Mac Pro running FreeBSD 15.0, with a GeForce GT 750M in a Thunderbolt
external chassis. vulkaninfo --summary reported the
card present. The Vulkan ICD loaded. The compute queue family
existed. Three tests ran:
- Vulkan initialization — pick a physical device, create a logical device, allocate a command pool, set up a descriptor pool. Pass.
- Tensor round-trip — allocate three host-visible
buffers, upload
[1, 2, 3, 4]to one of them, read it back, compare. Pass. - Elementwise add — compile a parametric GLSL shader,
upload
[1..8]and[10, 20, .. 80], dispatch with the spec-constant for ADD, download the result and compare to[11, 22, ..., 88]. Pass.
The benchmark numbers from that machine were not impressive — the GT 750M is a 2013 mobile GPU — but they were correct, repeatable, and they proved every single layer of the Vulkan stack was in place on FreeBSD.
FreeBSD GT 750M (Kepler, 384 cores):
N=1M: 0.30 ms GPU, 0.50 ms CPU, 1.66x speedup
N=4M: 1.41 ms GPU, 2.48 ms CPU, 1.76x speedup
At a million elements the GT 750M beats the 2013 Xeon by less than a factor of two. At four million it’s about the same. This is in the comfortable range where everyone always argues about whether GPU compute is “worth it”: on a card this old, against a CPU that’s about the same generation, on a workload that is specifically chosen to be small, the answer is “technically yes but barely.”
That was not the point of the run. The point of the run was that it ran at all, on FreeBSD, against a real GPU, and that the benchmark binary was the same shape we’d build on Linux. Which is what we did next.
The Linux follow-up
The handoff document specified an RTX 3070 8GB. The Linux desktop
turned out to have an RTX 3060 Ti 8GB — same Ampere arch,
slightly lower core count, no practical difference for the test.
Driver 580. Vulkan 1.3.275. The same git checkout
feature/vulkan-backend, the same c++ command
from the README, the same elementwise_binary.spv
shader binary out of the repo. No code changes.
Linux RTX 3060 Ti (Ampere, 4864 cores):
N=64K: 0.04 ms GPU, 0.06 ms CPU, 1.4x speedup
N=256K: 0.05 ms GPU, 0.20 ms CPU, 3.7x speedup
N=1M: 0.07 ms GPU, 0.62 ms CPU, 8.8x speedup ← sweet spot
N=4M: 0.22 ms GPU, 2.97 ms CPU, 13.3x speedup
The 4M-element case was where the story got interesting. The first three runs of the benchmark produced 1.93 ms, 2.01 ms, and 0.21 ms for the same workload. An order of magnitude of variance between consecutive runs of the same code on the same hardware. The kind of result that, at 11pm on a Tuesday, you stare at and suspect a memory leak.
The variance had two faces
The first face was the GPU’s power state machine. The 3060 Ti
idles at P8 — 210 MHz core, 18 watts —
and ramps to P0 — 1755 MHz core, 240 watts
— over what the driver reports as “a few hundred
microseconds.” Which is fine until you realize that a 4M
elementwise add at full clock takes about 220 microseconds, and
the bench was averaging 50 of them, and the first ten were the
GPU climbing to its operating frequency. The published benchmark
average was the GPU’s ramp time, not its compute time.
The fix for that face was straightforward: scale the warmup with the workload size, and burn enough iterations at large N to land the timed loop in the GPU’s steady state.
int warmup_iters = N >= 1048576 ? 30 : (N >= 65536 ? 10 : 3);
That cut variance for the 1M case dramatically. It only partially helped at 4M. Which led to the second face.
The second face was that this GPU drives a monitor. The display runs at 60 Hz, which is to say the display compositor preempts the compute queue every 16.67 milliseconds, and each preemption stalls the dispatch in flight. For dispatches that take 0.22 milliseconds each, in a benchmark that averages 50 of them across about 11 milliseconds of wall time, the laws of large numbers do not apply. A run that happens to land between display refreshes sees 0.22 ms; a run that catches one or two refreshes sees 1+ ms; neither is a bug.
The honest mitigation is to run on a headless GPU. The cosmetic mitigation is to use Vulkan timestamp queries instead of CPU clocks — which gives you the GPU’s view of the compute time, ignoring host-side preemption noise. We didn’t do either, because the variance is itself part of the story: this is what your users will see if they run the same benchmark on their desktop card. The best-observed numbers tell you what the GPU is capable of; the worst-observed numbers tell you what production on a contended GPU looks like; both are useful.
Two numbers
The blog headline is captured in two numbers. They are old friends now: 0.30 milliseconds, and 0.07 milliseconds. The first is the FreeBSD GT 750M’s steady state at one million elements. The second is the Linux RTX 3060 Ti’s steady state at the same size. They are the same code, the same shader, the same five hundred and forty lines of C++.
That is the proof that the cross-platform path is real and not contingent. The Vulkan backend is portable. The shader is portable. The build is portable. The performance is what the hardware can do, and on a card released in 2021 driving a desktop, it can do the elementwise add over a megabyte of floats in seventy microseconds.
What this opens up
Spirit’s use case — spin simulations — is the proof of concept. The kernel is parametric: the same shader handles add, subtract, multiply, divide, distinguished by a spec constant. The same plumbing handles every element-wise operation a simulator wants. There is no fundamental reason this is specific to Spirit, or to physics, or to FreeBSD.
Three things this makes possible immediately:
Nx.Vulkan
Elixir’s Nx tensor
library has two GPU backends: EXLA (XLA, requires a CUDA stack
or a TPU) and EMLX (MLX, Apple Silicon only). On FreeBSD with an
NVIDIA card, you have neither. The Spirit Vulkan backend, broken
out of Spirit and rewrapped as a Rustler NIF, is the third
backend. defn functions get a device:
:vulkan option. Your Elixir code runs on the GPU on
FreeBSD without compromising on Erlang+OTP, without booting
into a Linux jail, without depending on CUDA's BSD-hostile
toolchain.
The numbers we have are dispatch-only. They tell you the GPU performance. The thing that has to be built well, and which Spirit’s current bench does not yet do well, is persistent device-resident buffers: tensors that live on the GPU across operations, so the upload-compute-download dance happens once at the boundary instead of every operation. The numbers say this matters: at one million elements, dispatch is 0.07 ms and the full round trip is 50 ms. Persistent buffers eliminate the 49.93 ms.
GPU compute on FreeBSD as a deploy target
Once Nx works on FreeBSD via Vulkan, the BEAM application stack becomes a viable target for ML inference workloads on FreeBSD hardware. Which means jails. Which means ZFS snapshots of model weights. Which means — and this is the joke that started this whole thread of work — you can deploy a BEAM application that does GPU inference, in a FreeBSD jail, on a host that doesn’t have CUDA installed and never will. The single-host demo we’ve been building for the zed deploy tool gains a backend that does GPU inference. The trader runs its MCMC posteriors on the GPU. The recommendation engine runs its embeddings on the GPU. The ZFS snapshot of the model is a filesystem feature; the GPU compute is a Vulkan feature; the filesystem and the GPU are running on FreeBSD and not Linux, and nothing about that should be remarkable, but somehow it is.
Other GPUs, other operating systems
We tested NVIDIA. AMD’s Vulkan stack works the same way through Mesa. Intel’s integrated GPUs work through ANV. Apple Silicon has MoltenVK, which translates Vulkan to Metal. illumos has a preliminary NVIDIA driver with Vulkan support. The same shader, the same backend, runs on all of them. Not perfectly — some specialization is needed for f64 support, some for compute queue families — but the architecture is portable in a way that CUDA never was. The exit door for the platform-CUDA-lock problem is the door labelled “industry standard,” and the door turns out to be unlocked.
What’s next
The persistent-buffer optimization is the next iteration on Spirit’s side. Without it the GPU+xfer numbers are 99.9% transfer overhead and 0.1% compute — the dispatch optimization matters only if the calling code is structured to keep tensors on-device. Spirit’s simulation loop is naturally batched, so this is straightforward. Nx’s callgraph is more dynamic and will need a more careful contract.
After that: more shaders. The current one handles four binary
operations. We’ll need reductions (sum, mean, min, max), unary
ops (exp, log, sigmoid), matrix multiplication (the big one),
and convolution (the much bigger one). Each is a separate
.comp file, each maps to a single VkPipe,
each is a few dozen lines of GLSL. None of them require platform
work; they all run wherever the bench above runs.
After that: the Rustler NIF wrapper, the Nx backend interface, the integration tests, the Hex package. The Vulkan backend is the foundation. Everything else is mechanical assembly.
Coda
The cross-platform claim has a way of being true in principle and
not in practice. You build a thing on Linux. You write “tested
on FreeBSD” in the README. The first user who runs it on FreeBSD
files an issue. You discover that the build script assumed
GNU make, that the symbol versioning differs, that
kqueue wasn’t mocked, that the test fixtures use
/proc. Six months pass. The README quietly stops
mentioning FreeBSD.
The Vulkan compute backend has, so far, resisted that fate. Both because Vulkan was designed by people who had been bitten by it before, and because the FreeBSD GT 750M run came first. It is much harder to build a Linux-shaped artifact when the first machine the artifact has to run on is not Linux. The result, almost by accident, is a backend that compiles unchanged on the next box that needs it.
There is a temptation, for an industry that has spent ten years treating CUDA as a synonym for GPU compute, to read this as a political claim about NVIDIA. It is not. CUDA is excellent. The problem is that CUDA is excellent and proprietary, and the operating systems CUDA does not ship for include the ones some of us prefer to deploy on. The exit door is Vulkan. The exit door has been there the whole time.