The Backend That Didn’t Need to Know

The trader started cleanly at 02:25:15. By 02:25:30, all five paper-trading accounts had synced equity from Alpaca. By 02:26:00, eight of fifty instruments had completed their first NUTS warmup. By 02:28:55, the BEAM was gone. Not crashed in a clean, diagnose-by-stacktrace way. Gone — cpu_sup writing its parting line into the log and the supervisor tree dissolving upward in the standard Erlang choreography of cascading failure.

The error was visible if you knew where to look. It was scattered across the log thirteen times, inside Inspect.Error wrappers because something had tried to format a tensor for a debug message:

** (ArgumentError) argument error
    (nx_vulkan 0.0.1) Nx.Vulkan.Native.byte_size(#Reference<0.4146459861.1246363652.205724>)
    (nx_vulkan 0.0.1) lib/nx_vulkan/backend.ex:76: Nx.Vulkan.Backend.to_binary/2
    (nx 0.10.0) lib/nx.ex:2144: Nx.to_flat_list/2
    (exmc 0.1.0) lib/exmc/trading/regime_model.ex:225: Exmc.Trading.RegimeModel.extract_param/2

A NIF named byte_size received an opaque pointer and returned :badarg. Translated from Rustler’s conservative idiom, that means: the buffer behind this pointer is no longer valid. The buffer had been allocated by spirit, our vendored C++ Vulkan compute backend. Somewhere between the allocation and the byte_size call, the buffer’s backing memory had been freed, returned to the allocator’s pool, possibly even reallocated to another tensor. The pointer, preserved with a Rust resource wrapper and shipped via Rustler to Elixir, had outlived its referent. The compiler was right to complain.

This was supposed to be the cutover. Phase R4 of Mission II. mac-247, a 2013 Mac Pro running FreeBSD 15.0 with an NVIDIA GeForce GT 650M, would take over the live regime-model trading trial. The synthesised chain shader — eight free random variables, a softmax-mixture custom likelihood over two hundred observations, the whole production model rendered into elementwise_binary.spv and friends — would dispatch on real Vulkan compute, byte-identical to the dev-host Linux backend, in sixty milliseconds per K=32 leapfrog step. Three minutes after the trader booted, the BEAM was gone for the third time in twenty minutes, and the question stopped being “how do we ship” and started being “what are we shipping.”

What the bug actually was

The trader did not crash because the C++ backend was wrong. The backend ran the leapfrog steps correctly. The synthesised SPV file at ~/.exmc/gpu_node/spv/synth_e1b5a5fe….spv — one point seven megabytes of compiled GLSL representing two months of work on emitter passes — loaded, dispatched, and produced the same chain tensors as the validation reference. Mission II R3 had measured the cost: sixty milliseconds per dispatch on the GT 650M, eight point three times under the five-hundred-millisecond budget per NUTS sample. The bug was not in the path that mattered. The bug was in a path nobody had paid attention to.

Nx.Vulkan.Backend.to_binary/2 exists to satisfy the single most ordinary thing one does with a tensor: read its bytes back. It is called when you print a tensor, when you call Nx.to_flat_list/1, when you cast between types. In the trader, it was called when the dashboard polled an instrument for its current posterior summary — RegimeModel.signal_params/1 calls extract_param/2 calls Nx.to_flat_list, which calls Nx.Vulkan.Backend.to_binary, which calls Nx.Vulkan.Native.byte_size, which asks the C++ shim how many bytes lived behind this VkBuf*. The C++ shim, having no way to know whether the pointer was still valid, called the underlying Vulkan API. The driver returned an error. The NIF translated it to :badarg. The Inspect machinery, trying to render the tensor in a crash dump, called the NIF again, and again, in increasingly panicked formatting attempts, until the supervisor restart intensity exhausted its budget and the BEAM exited.

This is the textbook FFI bug. Two memory models meet at the boundary, neither can verify the other, and the canonical advice — document the ownership protocol and trust the caller — works fine until the caller is forty thousand lines of NUTS sampler with deferred tensor materialisation, hot-reload workers, and a dashboard polling every five seconds. The buffer that mattered was not the one we thought about; it was the one we forgot to keep alive. The C++ shim’s contract had said caller frees, and the caller had, eventually, freed it, without anyone noticing that another caller still held a copy of the pointer.

There is a name for this. It is called a use-after-free, and it is the second-most-cited reason that operating system kernels have been written in C for fifty years and that the people writing them have been miserable for forty-nine of those years.

The spike

Vulkano is a Rust wrapper over the Vulkan compute API. It is the direct competitor to the C++ approach that produced our spirit backend. We had ruled it out, somewhat presumptuously, on the grounds that migrating away from a working C++ codebase is expensive. After the third BEAM crash that night, the expense in question had a name: every dashboard poll cycle, every posterior summary, every trader instrument that wanted to know its own state.

A spike is the right answer to a hypothesis. The hypothesis was: a vulkano dispatcher, loading the exact same SPV file from the exact same on-disk cache, can produce the exact same output tensors as the C++ path, in the same perf envelope, without the ownership leak. If true, the migration is mechanical — the shader is already compiled, the calling convention is already defined, the SPV cache is already on disk. If false, we know not to spend two weeks on it.

The spike was a standalone Rust binary (nx_vulkan/spike/vulkano_synth/) that read seven binary blobs from argv — the SPV path, q_init, p_init, the obs+inv_mass packed buffer, the twenty-byte push constants, and the K-step count — allocated seven SSBOs through vulkano’s StandardMemoryAllocator, dispatched, and wrote four binary blobs as output. Two hundred fifty lines. The work was figuring out vulkano’s specialize method (it takes ahash::HashMap, not std::HashMap, and silently fails to compile if you don’t add the dep) and push_constants (it takes a typed struct, not raw bytes, so the push block’s field order has to be expressed as a Rust #[repr(C)] struct).

The binary ran on super-io, our Linux dev host with an RTX 3060 Ti, dispatched in 17.8 milliseconds. We then ran it against the C++ path on the same machine with the same inputs and the same SPV — cmp -s on all four output files returned zero. Byte-identical. Every q_chain, every p_chain, every gradient component, every log-probability, the same.

We pushed the binary to mac-247, where the production GPU lives. We compiled it.

warning: `vulkano_synth` (bin "vulkano_synth_dispatch") generated 1 warning
    Finished `release` profile [optimized] target(s) in 3m 18s

Three minutes eighteen seconds. On FreeBSD 15.0. Zero special configuration. cargo build --release. We had not included this on any cross-platform checklist because we had not believed it would work. Vulkano had been written by people who treated FreeBSD as a deployment target rather than a curiosity, and the Rust ecosystem had not punished them for it. The dispatcher ran on the GT 650M at 66 milliseconds per K=32 dispatch — within ten percent of the C++ path’s 60 milliseconds, within noise. The four output blobs matched byte-for-byte. The hypothesis held. The migration was mechanical.

From spike to NIF, from NIF to backend

What followed was a sequence of commits that, in retrospect, read as the natural unfolding of a decision already made.

First, the spike became a sibling Rustler crate (nx_vulkan/native/nx_vulkan_vulkano/) exposing Nx.Vulkan.NativeV.leapfrog_chain_synth/6. This took binaries on input and produced binaries on output, allocating buffers internally so that no VkBuf* ever escaped the Rust call. Then it grew the buffer-lifecycle primitives — buf_upload, buf_alloc, buf_download, buf_byte_size, buf_upload_into — each backed by a Rustler resource holding a vulkano Subbuffer<[u8]>. The Subbuffer cannot outlive its parent Buffer. Vulkano enforces this at the Rust type level. The compiler refuses to build code that tries.

Then it became a backend. Nx.Vulkan.VulkanoBackend implements the Nx.Backend behaviour with a tensor struct that holds a ResourceArc<VulkanoTensor> — the same Rustler resource type, now carrying its own shape and dtype. The five required storage callbacks (from_binary, to_binary, backend_copy, backend_transfer, backend_deallocate) route through the lifecycle NIFs. Then the compute callbacks, one stage at a time: elementwise binary (add, subtract, multiply, divide, pow, max, min) through a single apply_binary NIF that specialises elementwise_binary.spv at op-code 0 through 6; elementwise unary (exp, log, sqrt, abs, negate, sigmoid, tanh, floor, ceil, sign) through apply_unary; reductions (sum, reduce_max, reduce_min, all-axes and leading-and-trailing-axis) through reduce_axis; shape and movement (reshape and squeeze as zero-copy ref-rewraps, 2D transpose dispatching transpose.spv); 2D matmul dispatching matmul.spv for rank-2 by rank-2 with the standard contracting axes.

Twenty-four ops across five commits. Each commit ended with the same ritual: smoke test against Nx.BinaryBackend on super-io, push to mac-247, rebuild, smoke test again on the GT 650M to confirm the FreeBSD path agreed.

The descriptor pool that ate the production GPU

There is a moment in every serious port when the second machine disagrees with the first machine and a hypothesis must be abandoned. The benchmark was a triple workload sized for ten minutes of wall time on the GT 650M: a square-matmul scaling sweep from 16-by-16 to 2048-by-2048, a mixed pipeline (matmul-add-sigmoid-subtract-multiply-sum at 128-by-128), and the regime model’s K=32 chain dispatch repeated five hundred times. super-io completed all three in 57.8 seconds. mac-247 made it through the matmul sweep, started the pipeline workload, and crashed at roughly the five-thousandth iteration:

** (MatchError) no match of right hand side value:
   {:error, :dispatch_failed, "descriptor set: a non-validation error occurred"}
   (nx_vulkan 0.0.1) Nx.Vulkan.VulkanoBackend.sigmoid/2

“Non-validation error” in vulkano’s vocabulary means the underlying Vulkan call returned an error that was not a documented usage violation. Something the driver did not like, something it would not explain, something that on FreeBSD had a lower tolerance than on Linux. The pipeline workload allocates fresh descriptor sets per dispatch — five thousand of them across the bench — and at some point the GT 650M’s NVIDIA driver concluded it had allocated enough.

The first instinct was to bump the descriptor pool size. Vulkano exposes a set_count field on StandardDescriptorSetAllocatorCreateInfo, defaulting to thirty-two. We raised it to a thousand. Re-ran. mac-247 still failed, at the same place, with the same error. The Linux box — the one that had previously sailed through the bench — took six times longer on small-matmul, dragged down by some pathology the larger pool had introduced. We had achieved nothing except slowing down the working host.

The actual diagnosis required reading vulkano’s source. StandardDescriptorSetAllocator (line 448 of vulkano-0.34.2/src/descriptor_set/allocator.rs) contains a doc comment that, in retrospect, was telling us exactly what was wrong:

Each time a thread allocates using some descriptor set layout, and either no pools were initialized yet or all pools are full, a new pool is allocated for that thread and descriptor set layout combination.

Per layout. Per thread. The clue we missed was that vulkano identifies layouts by Rust object identity, not by structural equivalence. Our compute NIFs built the layout fresh each call — ShaderModule::new with the same SPV bytes, PipelineDescriptorSetLayoutCreateInfo::from_stages on the same bindings, PipelineLayout::new with the same arguments — and the vulkano allocator, finding no match in its layout-to-pool map, allocated a fresh DescriptorPool for each one. The thirty-two-slot recycling never engaged because no two calls used the same layout. After five thousand pipeline dispatches on FreeBSD, the driver had been asked for five thousand pools.

The fix was a hundred and twelve lines of Rust. A static OnceLock<Mutex<HashMap<(String, i32), CachedPipeline>>>, keyed by SPV path and specialisation constant, holding the Arc<PipelineLayout> and Arc<ComputePipeline>. Every NIF’s first order of business became get_or_create_pipeline. On the first call for a given (shader, op-code) pair, build the layout and pipeline, insert them under the key. On every subsequent call, return the cached Arc. The vulkano allocator now sees a single, stable layout identity per shader, recycles its thirty-two-slot pool indefinitely, and the underlying DescriptorPool count stays bounded.

The re-race after the fix — with the larger pool reverted to the default thirty-two — finished. mac-247 completed all three workloads in 94.5 seconds. super-io in 57.8. No descriptor errors on either host. The numbers told a more interesting story than the verdict.

Workload	super-io (RTX 3060 Ti)	mac-247 (GT 650M)	ratio
matmul 16×16	1.14 ms	0.75 ms	mac-247 faster
matmul 64×64	1.25 ms	0.77 ms	mac-247 faster
matmul 256×256	2.39 ms	2.62 ms	par
matmul 1024×1024	8.68 ms	105.2 ms	12×
matmul 2048×2048	35.3 ms	782 ms	22×
pipeline 128×128	4.19 ms	4.78 ms	par
regime K=32	15.2 ms	58.7 ms	3.9×

The small matmul was faster on the GT 650M. A twenty-six-fold weaker GPU, on a thirteen-year-old machine, beating an RTX 3060 Ti by a third on the same workload. The explanation is that the workload is not a GPU workload. The workload is a dispatch-overhead workload — the time spent inside the Vulkan driver building command buffers, submitting, waiting on the fence — and FreeBSD’s NVIDIA driver path is meaningfully shorter than Linux’s when the Linux box is also running a desktop, a browser, and Discord. The crossover is somewhere between 256 and 1024: that’s where the GPU compute starts to dominate dispatch overhead, and that’s where the RTX 3060 Ti starts behaving like a 30× faster GPU.

We had built the migration to get correctness. We got a benchmark that revealed something else: the production target was already as fast as the dev host on the workloads that mattered, until the workloads stopped being small.

The autograd that wasn’t there

The plan had been to write backward callbacks. Stage 8 of the internal roadmap document allocated a session to it: every forward op gets a paired backward, the way EXLA does, the way PyTorch does. Implement the gradient of matmul. Implement the gradient of sigmoid. Carry the adjoint through every primitive. It is a lot of code, and the authors of every framework that has done it have written miserably about the experience.

Before writing any of it, I wrote a test. The test computed the gradient of sum((x - target)²) with respect to x, on Nx.Vulkan.VulkanoBackend, via Nx.Defn.grad. The vulkano backend had no backward callbacks. None. Nx.Defn.grad was supposed to fail loudly — function backward_X not implemented or some such — and confirm what to build.

It returned [-1.0, 0.0, 1.0, 2.0]. The exact gradient of the expression. To eight decimal places.

There is a moment in the Soul of a New Machine where Tom West realises that the machine has been working for hours because no one had thought to check it. We were now in that moment, in reverse. The machine had been working because we had built it correctly, and we were going to spend a session writing code that had already, transparently, been written for us.

Nx.Defn.grad is a graph transformation. It runs at compile time on the Nx.Defn.Expr — the symbolic AST that Nx builds when a defn function is invoked. For every forward operation in the AST, grad inserts the corresponding backward operation, expressed in terms of more forward operations. The gradient of multiply(a, b) with respect to a is b: that’s a multiply node and a passthrough, not a new primitive. The gradient of sigmoid(x) is sigmoid(x) * (1 - sigmoid(x)): that’s a sigmoid, a subtraction, and a multiplication. The gradient of a matmul involves a transpose and another matmul. Every gradient, recursively, all the way down to constants and the original inputs, is expressible in the same primitives the forward pass used.

The backend is asked, by Nx.Defn.Evaluator walking the gradient-augmented AST, only ever to execute forward operations. It is never told whether the operation it is executing is part of the original computation or the gradient machinery that hangs off it. To the backend, both look the same. The backend does not need to know.

We validated by running a complete Axon training step — Dense(8 → 16, sigmoid) → Dense(2), MSE loss against a target, the loss differentiated by Nx.Defn.value_and_grad with respect to the model parameters — on Nx.Vulkan.VulkanoBackend. The forward loss came back as 1.9074409, byte-identical to the Nx.BinaryBackend reference. The sum of the dense_0 kernel gradient was 2.7619750, against a reference of 2.7619752. Relative difference: eight point six times ten to the minus eight. f32 precision, end to end, on the regime trial’s long-since-shipped GPU.

The session that had been budgeted for autograd implementation ended with a commit that touched twenty-six lines of code, none of which mentioned gradients.

After the bench

The first race was a yes/no question: does the migration cost performance? The data said no, within ten percent, on the workloads we measured. The proper question was the one that emerged after the pipeline cache: where does the vulkano backend sit relative to the other backends an Nx workload could actually choose?

The second bench answered it. Three backends present on every host that mattered: Nx.BinaryBackend (the single-threaded Elixir-implemented reference, slow but always correct), Nx.Vulkan.VulkanoBackend (the pure-Rust backend we had just shipped), and Nx.Vulkan.Backend (the C++ spirit backend that the migration was replacing). EXLA — the production-grade CUDA + CPU backend most Nx users default to — was nominally on the list, but it had two problems: it does not exist on FreeBSD (no precompiled XLA archive for amd64-freebsd15.0; building from source takes hours and a Bazel toolchain), and the loader on super-io for some combination of Mix.install and dependency resolution declined to make it visible to our bench script. We will measure against it next time. For now, the comparison is BinaryBackend versus the two GPU paths.

Square matmul, square root of N from sixteen to a thousand twenty-four, milliseconds per dispatch, median of forty to two hundred iterations:

matmul	BinaryBackend	VulkanoBackend	spirit (C++)
16×16 (super-io)	2.76	1.18	crashed
16×16 (mac-247)	2.51	1.06	1.16
64×64 (super-io)	130.76	7.07	—
64×64 (mac-247)	158.45	7.92	7.56
256×256 (super-io)	20,097	149.19	—
256×256 (mac-247)	13,891	136.10	141.73
1024×1024 (super-io)	n/a	2,323	—
1024×1024 (mac-247)	n/a	2,843	2,845

The matmul-256 row reveals what kind of speedup we’re actually delivering. BinaryBackend takes thirteen seconds for a single 256-by-256 matrix multiply on the FreeBSD box, twenty seconds on the Linux one. Either Vulkan backend finishes the same operation in 140 milliseconds. The factor is ninety-two times faster on FreeBSD, a hundred and thirty-five times faster on Linux. This is not because the GPU is fast. The GPU is from 2013. This is because BinaryBackend is a single-threaded Elixir loop doing scalar floating-point multiplications one at a time, and any backend that pushes the work down to a GPU dispatch — even a GPU dispatch on a GT 650M — wins by orders of magnitude.

The closer comparison is vulkano versus spirit. On every cell of the table where both numbers exist, they agree within five percent. At 256-by-256 vulkano wins by three milliseconds. At 64-by-64 spirit wins by four hundred microseconds. At 1024-by-1024 they finish within two milliseconds of each other on identical hardware. There is no compute regime, on the workloads we measured, where the C++ backend buys back the engineering cost of maintaining a C++ backend.

spirit on super-io crashed mid-bench with a memory-supervisor high-watermark warning — the same kind of resource-management fragility that motivated the migration in the first place. It worked on the production target. It did not survive a Linux laptop running a desktop session and a browser. We extracted no numbers from it. The blank cells in the table are real.

Elementwise and reduction workloads compressed the spread further. On a 16k-element add, vulkano took 27.6 ms on the GT 650M against spirit’s 38.9 ms — a thirty percent improvement that we suspect is pipeline-cache hits the C++ path never benefited from. On sigmoid, 13.9 vs 14.5 milliseconds, a wash. On a 1024-by-1024 sum, 1.36 vs 1.54 seconds, twelve percent to spirit. None of these spreads change the strategic picture. The two paths are operationally interchangeable, and one of them does not crash when a posterior summary thread polls the wrong tensor.

Then we ran the workload nobody had measured properly yet: the eXMC regime model’s log-posterior, the actual thing the trader computes every five seconds when the dashboard polls. Two hundred observations, eight free random variables, the softmax-mixture custom likelihood lowered through Defn and evaluated end-to-end. f32 inputs.

workload (mac-247)	BinaryBackend	VulkanoBackend	spirit (C++)
exmc regime log_p	3.56 ms	10.76 ms	20.21 ms

The vulkano path is not just architecturally cleaner. On the workload that motivated the entire migration, it is roughly twice as fast as the C++ path. The reason is not a faster shader — the SPV files are identical — but something subtler: vulkano’s pipeline cache is making the small-dispatch hot path measurably leaner than the C++ shim’s per-call setup. BinaryBackend still wins on this particular workload because it’s ten milliseconds of mostly-small-tensor work where the GPU dispatch overhead doesn’t amortise. That gap closes for any realistic-sized problem; in the regime trial the actual call that matters is the K=32 leapfrog dispatch, where the GPU paths beat BinaryBackend by orders of magnitude regardless of implementation.

We then asked the question the original race had failed to answer: does either GPU path survive five thousand iterations without crashing? Same workload, a 128-by-128 matmul-sigmoid-divide loop, repeated five thousand times on the GT 650M:

backend	5000 iter wall	ms/iter	outcome
VulkanoBackend	26.9 s	5.39	completed
spirit	31.7 s	6.35	completed

Both completed this time. spirit’s failure pattern from the earlier session was triggered by the multi-stage benchmark that ran matmul scaling, switched to pipeline workload at different shape, and asked spirit’s descriptor allocator to handle the workload transition. A pure same-shape loop doesn’t exercise that path. The crash is still in there; it just needs the right trigger. The vulkano path doesn’t have the trigger because its layouts are stable by construction.

What this is and what this isn’t

What this is: a Vulkan compute backend in pure Rust (nx_vulkan_vulkano, twelve hundred lines including the chain shader dispatch from the spike) that handles real Axon training and real eXMC NUTS sampling on hardware where EXLA does not exist and the previous C++ path could not stay alive for three minutes. Cross-host validated on RTX 3060 Ti and GT 650M Mac Edition (FreeBSD 15.0). f32 hot paths through elementwise_binary.spv, elementwise_unary.spv, reduce_axis.spv, matmul.spv. f64 hot paths through the _f64.spv variants where the device exposes shaderFloat64 (it does on every NVIDIA chip we care about). Host fallback through Nx.BinaryBackend for the long tail of shape configurations and op patterns we have not yet implemented natively. The whole thing weighs less than three thousand lines of Rust and three hundred lines of Elixir.

What this isn’t: complete. Scholar — the classical-ML companion to Axon — now smoke-tests cleanly (linear regression by normal equation, coefficients matching BinaryBackend to 2e-6), but the algorithm only works because Scholar’s internal use of Nx.Block.LinAlg.SVD is routed through a host-fallback block/4 callback that transfers every tensor to BinaryBackend, evaluates the block there, and transfers results back. Correctness, not speed. Native linalg shaders (SVD, QR, Cholesky, solve) are pending. The matmul shader is still f32 only; the regime model’s f64 dot products fall back to host. No persistent buffer pool yet (per-call allocation through vulkano’s StandardMemoryAllocator works but costs us a millisecond per dispatch that could be returned with caching). The R4 cutover — the live trader on mac-247 — has not been re-attempted with the new backend; the original crash was in the broader Vulkan backend’s buffer management, which the spike sidestepped rather than fixed. The migration is unfinished.

What it has done is settle two questions. The first is whether the C++ shim was load-bearing — whether the spirit backend, and the FFI ownership protocol it embedded, was the correct path forward despite its fragility. It was not. The ownership leak that crashed the trader was not a bug, in the sense that there was no line of code in violation of any documented contract. It was a category of mistake that the C++ type system could not detect, that the Rust type system detects automatically, and that the next port of any sufficiently-complex Vulkan workload to a long-running language runtime will encounter again, in identical form, until the FFI boundary is moved or the language is replaced.

The second is whether the framework’s architectural decisions matter for backend authors. They do. Nx.Defn.grad’s decision to express gradients as graph transformations over forward primitives — rather than as a separate vocabulary of backward operators the backend must implement — saved us a session of work and an unbounded amount of future maintenance. We did not need to implement backward_matmul because the graph already contains forward matmuls in places where backward matmuls should appear. The backend that knew the least about gradients was the one that supported them most completely.

The coda

At 05:21 on the morning of the race re-run, after the pipeline cache had landed, mac-247 was running the same workload that had killed the C++ backend in three minutes. The matmul scaling sweep, the mixed pipeline, the regime K=32 dispatches. Five thousand iterations of each, then ten thousand, then fifteen thousand, with the descriptor pool count staying flat at thirty-two and the dispatch latency steady at sixty milliseconds. Eventually I stopped the bench because there was no failure mode left to observe.

The GT 650M, having been judged twenty years ago as unfit for serious GPU compute and having spent twelve years in a Mac Pro nobody wanted, was performing roughly the same work on the same shader in the same amount of time, through a different runtime, without crashing. The crash had not been about the GPU. The crash had not been about the compute. The crash had been about who owned the buffer and when. We had moved that question across a single FFI boundary into a different language and discovered that the question no longer existed.

The backend that worked best was the one that knew the least — about ownership, about gradients, about which tensor was about to be inspected by a dashboard polling thread on a five-second timer. It allocated buffers. It dispatched shaders. It returned bytes. The rest of the system, watching it from above, could not have cared less how, and that turned out to be the point.

The vulkano spike binary is at spike/vulkano_synth/. The NIF crate is at native/nx_vulkan_vulkano/. The backend is lib/nx_vulkan/vulkano_backend.ex. The race bench, the pipeline cache, the f64 shader paths, the Axon training step that validated the autograd reveal: commits 4d0014e (spike) through b357d0f (autograd via Axon training), on the main branch. The bench host that revealed the race’s crossover between dispatch-bound and compute-bound regimes was a 2013 Mac Pro running FreeBSD 15.0 with a GeForce GT 650M Mac Edition. It turns out to still be useful for something.