Over 1k tok/s on an RTX 5090 with Qwen3 0.6B

I kept pushing the 5090 build and cleared the 1,030 tok/s mark. With the current tuning, the benchmark lands at ~1033 tok/s (0.97 ms/tok) for single-token decode on Qwen3-0.6B, same bfloat16 weights, same sequence lengths. The changes are small but very targeted. They shave a couple microseconds per step by cutting launch overhead, tightening the LM head path, and reducing a bit of loop control in the RMSNorm hot paths.

The biggest win was collapsing the two-phase LM head into a single kernel. Phase 1 already finds each block’s (max_logit, idx) pair, and phase 2 just reduces them. Fusing them removes a kernel launch and a full global readback of the partial buffer. The fused kernel uses a single global counter: every block writes its max, block 0 spins until all blocks have arrived, then performs the final reduction locally. That ends up saving ~2–3 us per step, which is surprisingly meaningful once you are in the ~1 ms regime.

I also reduced the attention worker pool from 16 blocks to 8 (LDG_ATTN_BLOCKS=8). Each block now normalizes and rotates two Q heads before starting attention. This shifts a few warps from “do attention” to “do prefetch,” and the flag syncs become slightly cheaper because there are fewer attention blocks updating the counters. The attention math is still bandwidth-light at short sequence positions, so this trade stays favorable.

Finally, I doubled the per-thread stride in the three RMSNorm loops (input, post-attn, and final). It’s a tiny micro-optimization, but those loops execute for every layer and every step, so shaving loop overhead helps. I also added a direct decode kernel that accepts position and token_id as arguments, which removes the pinned HtoD memcpy staging when running the single-step decode path.

I spent this weekend writing a single CUDA megakernel of about 1,200 lines that runs an entire Qwen3-0.6B forward pass in one persistent GPU launch. It decodes, under very specific conditions, at 1,000 tokens/second on an RTX 5090 in bfloat16 (no quantization), limited mostly by memory bandwidth now.

This kernel descends from Elliot Arledge’s MegaQwen, which achieved 530 tok/s on an RTX 3090, but only ~490 tok/s on a 5090. I had to spend a full day tuning the launch parameters and other minor optimizations to push it to around ~717 tok/s. This post will go through the full kernel architecture and every trick used in the stack for educational purposes. You can find the source code here.

Part 1#

Single-token decode in a 0.6B model is entirely memory-bound. Every step reads roughly 1.19 GB of weight data (800 MB across 28 layers + 311 MB for the LM head) and does rather trivial arithmetic on it. The RTX 5090’s GDDR7 can deliver 1,674 GB/s of read bandwidth (93% of the 1,792 GB/s theoretical peak). At that rate, the absolute minimum step time is:

1,192 MB / 1,674 GB/s = 712 us = 1,404 tok/s

Our kernel achieves roughly 1,000 us per step, which means we spend 712 us reading weights and 288 us on everything else, which includes synchronization, instruction overhead, and the autoregressive token readback. The entire optimization story is about shrinking that 288 us.

Part 2#

Everything runs inside ldg_decode_kernel_persistent: 128 thread blocks, 512 threads each, launched as a regular (non-cooperative) kernel. I did test higher thread block counts and so on, but after a lot of sweeps, 128 seems to be indeed the sweet spot for 0.6B shapes. The blocks stay resident for the entire forward pass and synchronize using custom atomic barriers. After the megakernel finishes, two small kernels compute the LM head argmax.

Each of the 28 layers proceeds through six phases: : "=r"(out.x), "=r"(out.y), "=r"(out.z), "=r"(out.w) : "l"(ptr));

Each load fetches 6 bytes (8x bfloat16 values). With 32 threads per warp each issuing one load like this, the warp generates a coalesced 512-byte transaction, which ends up being exactly 4 cache lines.

Online softmax with vec4 KV access#

The attention computation uses online softmax – a single pass over the KV cache that maintains a running max and exponential sum, which rescales prior accumulations when a new maximum is found. The benefit of doing this is that we avoid materializing the full attn matrix. Each of the 16 Q heads is assigned to one thread block, and within a block, the 16 warps process KV cache positions in a strided pattern. The Q-K dot product uses vec4 loads, where each lane holds 4 of the 128 Q dimensions in registers and loads the corresponding 4 K values with a single uint2 load, which ends up covering the full 128 dimensions across the 32-lane warp.

The fast_exp (borrowed from FlashInfer) used in softmax compiles to a single PTX instruction:

ptx_exp2 compiles to the ex2.approx.ftz.f32 instruction, which seems to be roughly ~10x faster than just using expf.

RoPE via warp shuffles#

Rotary position embeddings pair elements i with element i+64. Since all 128 head dimensions live across a single warp’s registers (32 lanes x 4 registers each), these pairs typically span different lanes. __shfl_sync fetches the partner value in a single cycle, which keeps the entire RoPE computation in registers with zero shared memory traffic.

The vocab projection (151,936 rows x 1024 columns = 311 MB) uses a different parallelization than the megakernel. 1,184 blocks of 256 threads, and with each warp processing 2 vocab rows simultaneously, which amortizes the hidden-state load. Each block emits it local (max_logit, max_index) pair. A single 256-thread block then does a tree reduction over the 1,184 partial results. The LM head achieves roughly ~1,500 GB/s bandwidth (90% of achievable).

Part 3#

At 813 tok/s, the kernel was spending 58% of its time reading weights and 42% waiting at barriers. Every per-layer grid.sync() costs ~3 us, and with 6 barriers across 28 layers, that’s ~500 us of pure synchronization overhead per step.

Persistent kernel with atomic barriers#

The cooperative kernel used cudaLaunchCooperativeKernel with cg::grid_group::sync(). The cooperative launch mechanism imposes constraints on block count and carriers per-barrier overhead. I ended up replacing it with a regular kernel launch and a custom barrier:

The most important design choice here is there monotonic generation counter with per-block local tracking. A naive sense-reversing barrier has an ABA race, where if a fast block finishes one barrier and enters the next before a slow block exits the current one, the slow block can misinterpret the fast block’s new arrival as the old barrier completing. The fix I arrived at was have each block track its own local_gen, which it increments after each barrier. The wait condition is *generation > my_local_gen, not *generation != last_read. Since local_gen is per-block and monotonic, a generation bump from barrier N can never satisfy barrier N+1’s wait condition (which requires generation > N+1). I use fence.acq_rel.gpu rather than __threadfence(), as you can see in the code snippet above. Both provide device-scope memory ordering, but the fence instruction seems to be lighter on Blackwell, where it establishes ordering constraints without necessarily flushing the entire L1.

The barrier state is reset on-device at kernel startup (block 0 writes zeros, all blocks sync via a bootstrap barrier), which avoids per-step cudaMemsetAsync host calls. Position and token ID are written to device memory via cudaMemcpyAsync from pinned host buffers, making the kernel args immutable and the launch CUDA-graph-compatible.

Flag-based partial barrier elimination#

Of the six per-layer barriers, two exist solely because of the 16 attention blocks. The other 112 blocks are idle during attention. Replacing these two barriers with lightweight atomic flags saves ~2 us per barrier x 28 layers = ~56 us:

KV cache readiness (barrier #2): block 0 writes KV cache entries during QK norm, then signals with atomicExch(kv_flag, layer + 1). Blocks 1-15 spin on the flag before starting attention. Blocks 16-127 skip entirely since they don’t do attention.

Attention completion (barrier #3): each attention block increments attn_flag after finishing. All 128 blocks wait for it to reach 16 * (layer + 1) before 0 projection.

These flags are completely independent of AtomicGridSync. They use monotonic counters (layer 0 sets to 1, layer 1 to 2, etc.) and need only a per-step reset. The remaining four barriers use AtomicGridSync normally, and all 128 blocks call it exactly four times per layer.

Productive spin#

This was the breakthrough, actually. The last two optimizations helped the kernel to go from ~813 tok/s to 890, then to 905, but this one took it to 1,000 tok/s. After the flag-based sync, blocks 16-127 sit in a useless busy-spin on attn_flag while 16 blocks compute attention. The attention phase touches almost no memory bandwidth (just ~400 KB of KV cache at typical sequence positions), which leaves the memory subsystem almost entirely idle.

I replaced the empty spin with prefetch.global.L2 instructions that warm the entire next phase’s weight set:

Thread 0 of each block still spins on attn_flag. Threads 1-511 issue prefetch instructions that bring 0 projection, gate, up, and down projection weights (~23 MB total) into L2 cache. When the attention blocks finish and all blocks proceed to 0 projection, the weights are already resident in L2 instead of requiring DRAM fetches. This seems to work because there’s no bandwidth contention, the L2 cache has plenty of room, and the timing is somewhat perfect. The 16 attention blocks read tiny KV cache data from one memory region. The 112 prefetch blocks target weight matrices in a completely different region, and the memory controller serves both concurrently without interference. Our GPU has a whole 96 B of L2, so it can hold 23 MB of weights plus 30 MB of KV cache with space to spare. Not to mention, the prefetch completes during attention. o_proj starts immediately after the attn_flag wait, so it manages to find warm L2 instead of cold DRAM. The latency of the entire O+MLP weight loading is entirely hidden behind the attention computation.

I did try to have idle warps inside o_proj and down proj blocks prefetch while active warps were computing. That unfortunately seemed to have hurt performance by about 8.7%. During those phases, the active warps are already saturating memory bandwidth with weight loads, so the prefetch competes. During attention, the bandwidth is 95% idle. I guess you should always prefetch into free bandwidth, and not into contended bandwidth.

Part 4#

A minor but free optimization: the first layer reads the embedding directly from the embedding table via __ldg instead of writing it to hidden_buffer and then reading it back. Since every block reads the same 2 KB row (1,024 bf16 values) and __ldg goes through L2, this is trivially fast and eliminates one full barrier at kernel startup.

Part 5#

More blocks (128->170) to use all SMs resulted in a -2% perf loss because it would need even more atomics per barrier. 256-bit vector loads (uint8) doesn’t work on sm_120 because it caps vector loads at 128-bits. Bulk L2 prefetch (cp.async.bulk.prefetch) had a much higher latency than __ldg fallback (about -5%). 2-warp cooperation for underutilized phases had a similar latency issue, since it adds __syncthreads() per-row overhead. L2 evict-first for weights was even worse at -8% because premature eviction hurts same-layer reuse. I tried CUDA graphs, but honestly it didn’t make a difference. Maybe per-step cudaStreamSynchronize for token readback just dominates. I did play around with --maxregcount tuning, but it didn’t do a thing. I think we’re already at near optimal register allocation here.

Final Part#

The 71% bandwidth utilization ceiling comes from the four remaining full-grid atomic barriers per layer (~2.2 us each x 4 = 8.8 us) plus two flag syncs (~0.2 us each) and miscellaneous instruction overhead. Getting past this would need me to either reduce the barrier count even further (which would be hard! the remaining four all need 128 blocks) or making each barrier cheaper (which is limited by fence.acq_rel.gpu latency and atomic round-trip time).

Aside from that, the only way I can think of beating this kernel for this exact workload is to either cheat with quantization, or violate the laws of physics.

Appendix: Profiling data#

Per-layer cost decomposition#

Measured at low sequence position (0-9) where attention overhead is negligible, using CUDA events:

One-third of the per-layer time is synchronization overhead. The rest is bounded by GDDR7 read bandwidth at 1,674 GB/s.

Step cost breakdown#

Bandwidth utilization across optimization stages#

Position-dependent throughput (sequence positions 0-200)#

Throughput stays above 970 tok/s out to position 200. The ~3% degradation comes from attention reading more KV cache entries at longer sequences. Unlike the original MegaQwen (which degraded from 530 to 158 tok/s over the same range on RTX 3090), the 5090’s larger L2 cache (96 MB vs 6 MB) keeps the KV cache resident.

Weight data budget#

Chrome trace (torch.profiler)#

A Chrome trace is available for download. Open it in chrome://tracing or Perfetto to see the per-step GPU timeline. Here’s the per-kernel breakdown averaged over 20 decode steps:

Each step consists of 6 operations on the CUDA stream: 2 small HtoD memcpys (position and token ID from pinned host memory to device), the persistent megakernel (all 28 layers), LM head phase 1 (1,184 blocks scanning 151K vocab rows), LM head phase 2 (single-block tree reduction), and 1 DtoH memcpy for the output token. The cudaStreamSynchronize for the DtoH copy is the serialization point that makes CUDA graphs unhelpful for autoregressive decode.

Hacker News