I kept pushing the 5090 build and cleared the 1,030 tok/s mark. With the current tuning, the benchmark lands at ~1033 tok/s (0.97 ms/tok) for single-token decode on Qwen3-0.6B, same bfloat16 weights, same sequence lengths. The changes are small but very targeted. They shave a couple microseconds per step by cutting launch overhead, tightening the LM head path, and reducing a bit of loop control in the RMSNorm hot paths.

The biggest win was collapsing the two-phase LM head into a single kernel. Phase 1 already finds each block’s (max_logit, idx) pair, and phase 2 just reduces them. Fusing them removes a kernel launch and a full global readback of the partial buffer. The fused kernel uses a single global counter: every block writes its max, block 0 spins until all blocks have arrived, then performs the final reduction locally. That ends up saving ~2–3 us per step, which is surprisingly meaningful once you are in the ~1 ms regime.

I also reduced the attention worker pool from 16 blocks to 8 (LDG_ATTN_BLOCKS=8). Each block now normalizes and rotates two Q heads before starting attention. This shifts a few warps from “do attention” to “do prefetch,” and the flag syncs become slightly cheaper because there are fewer attention blocks updating the counters. The attention math is still bandwidth-light at short sequence positions, so this trade stays favorable.

Finally, I doubled the per-thread stride in the three RMSNorm loops (input, post-attn, and final). It’s a tiny micro-optimization, but those loops execute for every layer and every step, so shaving loop overhead helps. I also added a direct decode kernel that accepts position and token_id as arguments, which removes the pinned HtoD memcpy staging when running the single-step decode path.

I spent this weekend writing a single CUDA megakernel of about 1,200 lines that runs an entire Qwen3-0.6B forward pass in one persistent GPU launch. It decodes, under very specific conditions, at 1,000 tokens/second on an RTX 5090 in bfloat16 (no quantization), limited mostly by memory bandwidth now.

This kernel descends from Elliot Arledge’s MegaQwen, which achieved 530 tok/s on an RTX 3090, but only ~490 tok/s on a 5090. I had to spend a full day tuning the launch parameters and other minor optimizations to push it to around ~717 tok/s. This post will go through the full kernel architecture and every trick used in the stack for educational purposes. You can find the source code here.

Part 1#

Single-token decode in a 0.6B model is entirely memory-bound. Every step reads roughly 1.19 GB of weight data (800 MB across 28 layers + 311 MB for the LM head) and does rather trivial arithmetic on it. The RTX 5090’s GDDR7 can deliver 1,674 GB/s of read bandwidth (93% of the 1,792 GB/s theoretical peak). At that rate, the absolute minimum step time is:

1,192 MB / 1,674 GB/s = 712 us = 1,404 tok/s

Our kernel achieves roughly 1,000 us per step, which means we spend 712 us reading weights and 288 us on everything else, which includes synchronization, instruction overhead, and the autoregressive token readback. The entire optimization story is about shrinking that 288 us.

Part 2#

Everything runs inside ldg_decode_kernel_persistent: 128 thread blocks, 512 threads each, launched as a regular (non-cooperative) kernel. I did test higher thread block counts and so on, but after a lot of sweeps, 128 seems to be indeed the sweet spot for 0.6B shapes. The blocks stay resident for the entire forward pass and synchronize using custom atomic barriers. After the megakernel finishes, two small kernels compute the LM head argmax.

Each of the 28 layers proceeds through six phases: : "=r"(out.x), "=r"(out.y), "=r"(out.z), "=r"(out.w) : "l"(ptr));