The GPU That Doesn’t Need CUDA

The first machine to run a GPU compute kernel for the Spirit spin simulator on FreeBSD was a 2013 Mac Pro — the trash-can model — with a GeForce GT 750M bolted on. 384 CUDA cores. Kepler. Released the year of the first iPhone with Touch ID. The kind of card you find in a dead laptop in a drawer. It produced the entirely respectable result of 0.30 milliseconds for an elementwise add over a million single-precision floats, beating the same machine’s CPU by 1.66x and proving that yes, you can in fact do GPU compute on FreeBSD.

That was Tuesday. By Friday, the same backend code, compiled on a Linux desktop, was running on a GeForce RTX 3060 Ti — 4,864 CUDA cores, Ampere, eight gigabytes of GDDR6 — and producing the same elementwise add over four million floats in 0.22 milliseconds. Thirteen times faster than the CPU on the same machine, and seven times faster per element than the GT 750M was on its own machine. Same backend. Same shader. Same five hundred and forty lines of C++.

The lesson is not that the RTX 3060 Ti is a faster GPU than a 2013 GT 750M. It is that the path between Spirit and the GPU went through Vulkan, and Vulkan is the same on FreeBSD as it is on Linux as it is on Windows, and so the work that proved the FreeBSD path is the work that runs unchanged on a real workstation card.

Why Vulkan, not CUDA

Spirit, like most physics simulators with a respectable history, ships a CUDA backend. The CUDA backend is excellent. It is also useless on FreeBSD — NVIDIA does not ship a CUDA driver stack for FreeBSD, has not ever done so, and gives no indication of plans to. The same is true for illumos, OpenBSD, NetBSD, DragonFlyBSD, and any other operating system that doesn’t pay NVIDIA enough in market share to be worth the engineering invest.

Vulkan is different. Vulkan ships as part of the open-source NVIDIA graphics driver. FreeBSD’s NVIDIA driver includes the Vulkan ICD. The Mesa drivers (open-source, BSD-licensed) ship Vulkan for AMD and Intel cards on FreeBSD too. The Vulkan SDK itself is BSD-licensed. The whole stack is portable in a way that CUDA fundamentally is not, because CUDA is the proprietary product and Vulkan is the open standard.

The ergonomics are worse. The boilerplate is worse. The error messages are worse. The compute primitives that CUDA gives you in two lines of __global__ require fifty lines of descriptor set allocation, pipeline layout binding, push constant configuration, and command buffer recording in Vulkan. This is not an objection. The ergonomics are worse on FreeBSD too, because FreeBSD is not the operating system commodity GPU compute was designed for. We are paying ergonomics for portability, and we are getting it.

The Mac Pro proof

The first run was on free-macpro-gpu.local, a 2013 Mac Pro running FreeBSD 15.0, with a GeForce GT 750M in a Thunderbolt external chassis. vulkaninfo --summary reported the card present. The Vulkan ICD loaded. The compute queue family existed. Three tests ran:

Vulkan initialization — pick a physical device, create a logical device, allocate a command pool, set up a descriptor pool. Pass.
Tensor round-trip — allocate three host-visible buffers, upload [1, 2, 3, 4] to one of them, read it back, compare. Pass.
Elementwise add — compile a parametric GLSL shader, upload [1..8] and [10, 20, .. 80], dispatch with the spec-constant for ADD, download the result and compare to [11, 22, ..., 88]. Pass.

The benchmark numbers from that machine were not impressive — the GT 750M is a 2013 mobile GPU — but they were correct, repeatable, and they proved every single layer of the Vulkan stack was in place on FreeBSD.

FreeBSD GT 750M (Kepler, 384 cores):
  N=1M:  0.30 ms GPU,  0.50 ms CPU,  1.66x speedup
  N=4M:  1.41 ms GPU,  2.48 ms CPU,  1.76x speedup

At a million elements the GT 750M beats the 2013 Xeon by less than a factor of two. At four million it’s about the same. This is in the comfortable range where everyone always argues about whether GPU compute is “worth it”: on a card this old, against a CPU that’s about the same generation, on a workload that is specifically chosen to be small, the answer is “technically yes but barely.”

That was not the point of the run. The point of the run was that it ran at all, on FreeBSD, against a real GPU, and that the benchmark binary was the same shape we’d build on Linux. Which is what we did next.

The Linux follow-up

The handoff document specified an RTX 3070 8GB. The Linux desktop turned out to have an RTX 3060 Ti 8GB — same Ampere arch, slightly lower core count, no practical difference for the test. Driver 580. Vulkan 1.3.275. The same git checkout feature/vulkan-backend, the same c++ command from the README, the same elementwise_binary.spv shader binary out of the repo. No code changes.

Linux RTX 3060 Ti (Ampere, 4864 cores):
  N=64K:   0.04 ms GPU,  0.06 ms CPU,  1.4x speedup
  N=256K:  0.05 ms GPU,  0.20 ms CPU,  3.7x speedup
  N=1M:    0.07 ms GPU,  0.62 ms CPU,  8.8x speedup     ← sweet spot
  N=4M:    0.22 ms GPU,  2.97 ms CPU, 13.3x speedup

The 4M-element case was where the story got interesting. The first three runs of the benchmark produced 1.93 ms, 2.01 ms, and 0.21 ms for the same workload. An order of magnitude of variance between consecutive runs of the same code on the same hardware. The kind of result that, at 11pm on a Tuesday, you stare at and suspect a memory leak.

The variance had two faces

The first face was the GPU’s power state machine. The 3060 Ti idles at P8 — 210 MHz core, 18 watts — and ramps to P0 — 1755 MHz core, 240 watts — over what the driver reports as “a few hundred microseconds.” Which is fine until you realize that a 4M elementwise add at full clock takes about 220 microseconds, and the bench was averaging 50 of them, and the first ten were the GPU climbing to its operating frequency. The published benchmark average was the GPU’s ramp time, not its compute time.

The fix for that face was straightforward: scale the warmup with the workload size, and burn enough iterations at large N to land the timed loop in the GPU’s steady state.

int warmup_iters = N >= 1048576 ? 30 : (N >= 65536 ? 10 : 3);

That cut variance for the 1M case dramatically. It only partially helped at 4M. Which led to the second face.

The second face was that this GPU drives a monitor. The display runs at 60 Hz, which is to say the display compositor preempts the compute queue every 16.67 milliseconds, and each preemption stalls the dispatch in flight. For dispatches that take 0.22 milliseconds each, in a benchmark that averages 50 of them across about 11 milliseconds of wall time, the laws of large numbers do not apply. A run that happens to land between display refreshes sees 0.22 ms; a run that catches one or two refreshes sees 1+ ms; neither is a bug.

The honest mitigation is to run on a headless GPU. The cosmetic mitigation is to use Vulkan timestamp queries instead of CPU clocks — which gives you the GPU’s view of the compute time, ignoring host-side preemption noise. We didn’t do either, because the variance is itself part of the story: this is what your users will see if they run the same benchmark on their desktop card. The best-observed numbers tell you what the GPU is capable of; the worst-observed numbers tell you what production on a contended GPU looks like; both are useful.

Two numbers

The blog headline is captured in two numbers. They are old friends now: 0.30 milliseconds, and 0.07 milliseconds. The first is the FreeBSD GT 750M’s steady state at one million elements. The second is the Linux RTX 3060 Ti’s steady state at the same size. They are the same code, the same shader, the same five hundred and forty lines of C++.

That is the proof that the cross-platform path is real and not contingent. The Vulkan backend is portable. The shader is portable. The build is portable. The performance is what the hardware can do, and on a card released in 2021 driving a desktop, it can do the elementwise add over a megabyte of floats in seventy microseconds.

What this opens up

Spirit’s use case — spin simulations — is the proof of concept. The kernel is parametric: the same shader handles add, subtract, multiply, divide, distinguished by a spec constant. The same plumbing handles every element-wise operation a simulator wants. There is no fundamental reason this is specific to Spirit, or to physics, or to FreeBSD.

Three things this makes possible immediately:

Nx.Vulkan

Elixir’s Nx tensor library has two GPU backends: EXLA (XLA, requires a CUDA stack or a TPU) and EMLX (MLX, Apple Silicon only). On FreeBSD with an NVIDIA card, you have neither. The Spirit Vulkan backend, broken out of Spirit and rewrapped as a Rustler NIF, is the third backend. defn functions get a device: :vulkan option. Your Elixir code runs on the GPU on FreeBSD without compromising on Erlang+OTP, without booting into a Linux jail, without depending on CUDA's BSD-hostile toolchain.

The numbers we have are dispatch-only. They tell you the GPU performance. The thing that has to be built well, and which Spirit’s current bench does not yet do well, is persistent device-resident buffers: tensors that live on the GPU across operations, so the upload-compute-download dance happens once at the boundary instead of every operation. The numbers say this matters: at one million elements, dispatch is 0.07 ms and the full round trip is 50 ms. Persistent buffers eliminate the 49.93 ms.

GPU compute on FreeBSD as a deploy target

Once Nx works on FreeBSD via Vulkan, the BEAM application stack becomes a viable target for ML inference workloads on FreeBSD hardware. Which means jails. Which means ZFS snapshots of model weights. Which means — and this is the joke that started this whole thread of work — you can deploy a BEAM application that does GPU inference, in a FreeBSD jail, on a host that doesn’t have CUDA installed and never will. The single-host demo we’ve been building for the zed deploy tool gains a backend that does GPU inference. The trader runs its MCMC posteriors on the GPU. The recommendation engine runs its embeddings on the GPU. The ZFS snapshot of the model is a filesystem feature; the GPU compute is a Vulkan feature; the filesystem and the GPU are running on FreeBSD and not Linux, and nothing about that should be remarkable, but somehow it is.

Other GPUs, other operating systems

We tested NVIDIA. AMD’s Vulkan stack works the same way through Mesa. Intel’s integrated GPUs work through ANV. Apple Silicon has MoltenVK, which translates Vulkan to Metal. illumos has a preliminary NVIDIA driver with Vulkan support. The same shader, the same backend, runs on all of them. Not perfectly — some specialization is needed for f64 support, some for compute queue families — but the architecture is portable in a way that CUDA never was. The exit door for the platform-CUDA-lock problem is the door labelled “industry standard,” and the door turns out to be unlocked.

What’s next

The persistent-buffer optimization is the next iteration on Spirit’s side. Without it the GPU+xfer numbers are 99.9% transfer overhead and 0.1% compute — the dispatch optimization matters only if the calling code is structured to keep tensors on-device. Spirit’s simulation loop is naturally batched, so this is straightforward. Nx’s callgraph is more dynamic and will need a more careful contract.

After that: more shaders. The current one handles four binary operations. We’ll need reductions (sum, mean, min, max), unary ops (exp, log, sigmoid), matrix multiplication (the big one), and convolution (the much bigger one). Each is a separate .comp file, each maps to a single VkPipe, each is a few dozen lines of GLSL. None of them require platform work; they all run wherever the bench above runs.

After that: the Rustler NIF wrapper, the Nx backend interface, the integration tests, the Hex package. The Vulkan backend is the foundation. Everything else is mechanical assembly.

Coda

The cross-platform claim has a way of being true in principle and not in practice. You build a thing on Linux. You write “tested on FreeBSD” in the README. The first user who runs it on FreeBSD files an issue. You discover that the build script assumed GNU make, that the symbol versioning differs, that kqueue wasn’t mocked, that the test fixtures use /proc. Six months pass. The README quietly stops mentioning FreeBSD.

The Vulkan compute backend has, so far, resisted that fate. Both because Vulkan was designed by people who had been bitten by it before, and because the FreeBSD GT 750M run came first. It is much harder to build a Linux-shaped artifact when the first machine the artifact has to run on is not Linux. The result, almost by accident, is a backend that compiles unchanged on the next box that needs it.

There is a temptation, for an industry that has spent ten years treating CUDA as a synonym for GPU compute, to read this as a political claim about NVIDIA. It is not. CUDA is excellent. The problem is that CUDA is excellent and proprietary, and the operating systems CUDA does not ship for include the ones some of us prefer to deploy on. The exit door is Vulkan. The exit door has been there the whole time.