Tokasaurus: An LLM inference engine for high-throughput workloads

2025-06-0521:2721824scalingintelligence.stanford.edu

TL;DR

We’re releasing Tokasaurus, a new LLM inference engine optimized for throughput-intensive workloads. With small models, Tokasaurus benefits from very low CPU overhead and dynamic Hydragen grouping to exploit shared prefixes. For larger models, Tokasaurus supports async tensor parallelism for GPUs with NVLink and a fast implementation of pipeline parallelism for GPUs without. On throughput-focused benchmarks, Tokasaurus can outperform vLLM and SGLang by up to 3x+.

Table of Contents

Intro

As LLMs get smarter, faster, and cheaper, the community keeps finding new ways to use them. Our own recent work has explored using models to scan every file in a codebase, sample 10,000 attempts for math and code problems, and collaborate with other models to minimize cloud costs. Inference is now also an important part of the training process, where we use models to generate synthetic data or as part of RL pipelines that generate and train on model completions.

Crucially, these new inference workloads look quite different than the original LLM use case of serving a chatbot. Here, we care primarily about the total time and cost required to complete a large batch of sequences, and we care much less (if at all) about the individual latency of a single generation. In other words, we want high throughput!

Open-source inference engines (i.e. dedicated systems for running efficient LLM inference) like FlexGen, vLLM, and SGLang have been enormously valuable to the community. Inspired by and learning from these projects, we built a new engine, Tokasaurus, designed from the ground up to handle throughput-focused workloads. We’ve optimized Tokasaurus for efficiently serving large and small models alike, allowing it to outperform existing engines on throughput benchmarks. In the rest of this blog, we’ll walk through some of these optimizations and show off a few settings where Tokasaurus really shines.

Optimizing Small Models

To benchmark Tokasaurus with small models, we’ll use two workloads:

  • Completing chatbot prompts from the ShareGPT dataset (this is a common benchmark for testing inference engines).
  • Reproducing an experiment from Large Language Monkeys, where we take 128 problems from the GSM8K math dataset and sample 1024 answers to every problem. The distinguishing feature of this workload is that there’s a lot of prefix sharing across sequences.
Tokasaurus large batch sampling

Tokasaurus outperforms vLLM and SGLang on both of these benchmarks, in particular achieving over 2x the throughput of other engines on the Large Language Monkeys workload. Two main features contribute to these wins with small models:

LLM engines perform many different tasks on the CPU, like handling web requests, tokenizing inputs/detokenizing outputs, managing KV cache allocation, and preparing inputs for the model. If these CPU-side tasks cause the GPU-side model to stall, throughput can take a big hit. To avoid these stalls, inference engines commonly make many CPU-side tasks asynchronous: while the GPU runs a forward pass for batch N, the CPU-side of the engine post-processes the results from batch N-1 and prepares the inputs for batch N+1.

Tokasaurus goes one step further, making the CPU-side of the engine (what we call the manager) both asynchronous and adaptive. The manager’s goal is to maintain a deep queue of inputs for the model to run forward passes on. The manager monitors the size of this queue and can detect if the model is close to exhausting it (and therefore stalling the GPU). In these cases, the manager will automatically start skipping optional steps (like checking for stop strings and onboarding new sequences) until the model’s input queue is sufficiently deep again. This combination of asynchrony and adaptivity lets Tokasaurus serve small models with much less CPU overhead.

Dynamic Prefix Identification and Exploration

Prefix sharing comes up all the time in LLM inference — not just when repeatedly sampling like in the Large Language Monkeys benchmark, but also when asking many questions about a long document or reusing a system prompt across many chatbot conversations.

Shared prefixes allow attention to be computed more efficiently. We first explored this idea last year with Hydragen (aka cascade attention and bifurcated attention), but at the time we didn’t address the problem of detecting these shared prefixes in an engine where sequences are constantly starting and finishing. With Tokasaurus, we solve this detection problem by running a greedy depth-first search algorithm before every model forward pass that iteratively finds the longest shared prefixes possible. Hydragen is most impactful for small models, which spend a relatively larger fraction of total FLOPs on attention.

Optimizing Bigger Models

Tokasaurus can also efficiently serve bigger models across multiple GPUs! Here, the most important optimizations are our implementations of pipeline parallelism (PP) and tensor parallelism (TP), which allow us to maximize throughput on GPUs with or without NVLink.

Pipeline Parallelism for the GPU Poor

One of our original goals with Tokasaurus was to efficiently run multi-GPU inference on our lab’s L40S GPUs, which don’t have fast inter-GPU NVLink connections. Without NVLink, the communication costs incurred running TP across a node of eight GPUs are substantial. Therefore, efficient support for PP (which requires much less inter-GPU communication) was a high priority. PP needs a large batch in order to run efficiently, since batches from the manager are subdivided into microbatches that are spread out across pipeline stages. When optimizing for throughput, we’re generally already using the largest batch size that fits in GPU memory, so PP is often a natural fit for throughput-focused workloads. When benchmarking against vLLM’s and SGLang’s pipeline parallel implementations using Llama-3.1-70B on eight L40S GPUs, Tokasaurus improves throughput by over 3x:

Tokasaurus small models

Async Tensor Parallel for the GPU Rich

If you do have GPUs with NVLink (e.g. B200s and certain models of H100s and A100s), Tokasaurus has something for you too! Models in Tokasaurus can be torch compiled end-to-end, allowing us to take advantage of Async Tensor Parallelism (Async-TP). This is a relatively new feature in PyTorch that can overlap inter-GPU communication with other computations, partially hiding the cost of communication. In our benchmarks, we found that Async-TP adds a lot of CPU overhead to the model forward pass and only starts improving throughput with very large batch sizes (e.g. 6k+ tokens). Tokasaurus maintains torch-compiled versions of our models with and without Async-TP enabled, allowing us to automatically switch on Async-TP whenever the batch size is big enough:

Tokasaurus small models

Try it Out

Tokasaurus started as an internal lab effort to run our inference experiments faster, and we’re excited to share it more broadly! You can check out the Tokasaurus code on GitHub and install the package from PyPI with:

Currently, we support models from the Llama-3 and Qwen-2 families and support any combination of data, tensor, and pipeline parallel within a single node.

Tokasaurus is written in pure Python (although we do use attention and sampling ops from the excellent FlashInfer package). We hope that this makes the engine easier to fork and hack on, à la GPT-fast.

Benchmarking Details

The commands for reproducing our benchmarks are available here. For each benchmark, we configure all engines with the same KV cache size and maximum number of running requests. We’ve made a best effort to tune each engine’s remaining parameters. We report the average throughput across runs after completing a warmup run. For each benchmark, all engines are run on the same machine.

We use this script from SGLang for our ShareGPT benchmarks and this custom script for the Large Language Monkeys benchmark. To standardize our benchmarking scripts and interface, all experiments send requests through the OpenAI API. We also experimented with vLLM’s Python API (i.e. LLM.generate()) on the Large Language Monkeys benchmark with Llama-1B and measured roughly a 5% throughput increase (thanks to the vLLM team for the tip!).

Acknowledgements

Huge thanks to Prime Intellect and Together AI for providing us with compute for this project.

Also, we’re grateful to Dan Biderman, Simon Guo, Manat Kaur, and Avanika Narayan for beta testing the engine!

If you find Tokasaurus useful, please use the following citation:

@misc{juravsky2025tokasaurus, author = {Jordan Juravsky and Ayush Chakravarthy and Ryan Ehrlich and Sabri Eyuboglu and Bradley Brown and Joseph Shetaye and Christopher R{\'e} and Azalia Mirhoseini}, title = {Tokasaurus: An LLM Inference Engine for High-Throughput Workloads}, year = {2025}, howpublished = {\url{https://scalingintelligence.stanford.edu/blogs/tokasaurus/}}
}

Read the original article

Comments

  • By refibrillator 2025-06-062:052 reply

    The code has few comments but gotta love when you can tell someone was having fun!

    https://github.com/ScalingIntelligence/tokasaurus/blob/65efb...

    I’m honestly impressed that a pure python implementation can beat out vLLM and SGLang. Granted they lean on FlashInfer, and of course torch.compile has gotten incredibly powerful in the last few years. Though dynamic shapes have still been a huge thorn in my side, I’ll need to look closer at how they pulled it off…

    • By bobrenjc93 2025-06-063:181 reply

      Hi! I work on dynamic shapes in pytorch and would love to hear more about the challenges you’ve run into. We’re always looking to improve the experience, so if you’re open to chatting, feel free to DM me on Twitter (@bobrenjc93) or email me at bobren@meta.com.

      • By gricardo99 2025-06-065:001 reply

        since you work on pytorch, what would you say is the best place to ask questions about general usage, trouble shooting? I’ve been struggling with a, what I would consider, a simple torchrun elastic training example, and haven’t found any good resources online. I’ve been spelunking through pytorch but have a feeling a little back and forth with someone familiar with these features would immensely clear things up.

        • By bobrenjc93 2025-06-065:101 reply

          PyTorch Dev Discuss is a fantastic forum where many core devs actively participate and answer questions: https://dev-discuss.pytorch.org

          In addition to Dev Discuss, a number of core contributors are also active on Twitter. Two particularly helpful and prolific voices are @ezyang and @cHHillee.

          Finally, don’t overlook GitHub issues—they’re a surprisingly effective way to start conversations. If you’ve found a bug or have ideas on how to improve the APIs, opening an issue is always welcome.

          • By almostgotcaught 2025-06-0611:26

            There's also the slack but you gotta know someone to get on that ;)

    • By chillee 2025-06-064:49

      I mean, vllm and sglang are both "pure python" essentially as well. But yeah, in ML you rarely require C++ to get good performance for most of the systems people are writing.

  • By nabakin 2025-06-0522:362 reply

    > On throughput-focused benchmarks, Tokasaurus can outperform vLLM and SGLang by up to 3x+.

    Looks like they don't compare to TensorRT-LLM throughput numbers which, last I checked, are SOTA in open source.

    • By andersa 2025-06-060:531 reply

      TensorRT-LLM being open source is a lie, all the important kernels are loaded from cubins.

      • By nabakin 2025-06-0720:47

        Yeah you're right (although, they started to open source some of that recently iirc). I meant SOTA for inference engines we can actually download and use ourselves.

    • By qeternity 2025-06-067:23

      It also appears that this was a sampling benchmark...which is not representative.

      Generation benchmark was 5% faster than SGLang.

  • By behnamoh 2025-06-0521:563 reply

    While Tokasaurus’s Async-TP shows impressive throughput gains, it seems over-engineered for common use cases. The CPU overhead from async tensor parallelism only pays off at 6k+ token batches, and you need NVLink-connected GPUs to see real benefits. Most prod deployments don’t need this complexity — you’re better off with simpler approaches unless you’re specifically optimizing for massive batch throughput. The adaptive manager skipping “optional” tasks under load also feels concerning from a reliability perspective.

    • By YetAnotherNick 2025-06-0522:521 reply

      Depends on what production means for you. This is useful for batch production jobs.

      Also, this seems very useful for generating synthetic data or labelling a bunch of data. 6k batch size is small for data labelling.

      • By cpard 2025-06-064:101 reply

        How big of a use case is synthetic data generation? I’m curious as I see a lot about it coming from academic projects but I haven’t seen much related to commercial use cases

        • By electroglyph 2025-06-065:351 reply

          tiny NNs distilled from LLMs can produce some amazing results, i'm surprised it's not more common tbh

    • By bjt12345 2025-06-0522:341 reply

      Buy surely next years production deployments will be very different to right now, with different use cases...etc

      • By jdiff 2025-06-0523:46

        Sure. Things change over time. Is there a reason to believe they'd be different in such a way that this would be more useful than in today's landscape? I haven't seen such a forecast myself.

HackerNews