Hacker News

Why DeepSeek is cheap at scale but expensive to run locally

2025-06-017:31328228www.seangoedecke.com

Why is DeepSeek-V3 supposedly fast and cheap to serve at scale, but too slow and expensive to run locally? Why are some AI models slow to respond but fast once…

Show article

Why is DeepSeek-V3 supposedly fast and cheap to serve at scale, but too slow and expensive to run locally? Why are some AI models slow to respond but fast once they get going?

AI inference providers often talk about a fundamental tradeoff between throughput and latency: for any given model, you can either serve it at high-throughput high-latency, or low-throughput low-latency. In fact, some models are so naturally GPU-inefficient that in practice they must be served at high-latency to have any workable throughput at all (for instance, DeepSeek-V3).

This tradeoff comes from the batch size the inference provider chooses for the model: not batching inference inside an individual request¹, but batching inference across tens or hundreds of concurrent user requests. It’s a peculiar feature of transformer-based LLMs that computing a batch of completions at the same time is almost as fast as computing a single completion. Why is that?

What is batch inference?

GPUs are good at doing big matrix multiplications (GEMMs, or “general matrix multiplications”). Say you have a single token that you want to pass through a model (i.e. by multiplying against all its weights - other architecture details aren’t relevant). You express that as a vector that matches the dimension (or hidden size) of the model (i.e. 1 x the width of its big weights matrices) and multiply it through. That’s 1 GEMM. But if you want to pass ten tokens through in a batch, that’s still only one GEMM, because you can stack the tokens into one matrix (10 x the model dimension). That’s a lot faster than doing ten slightly smaller GEMMs. So an inference server implementation might look something like this:

A request comes in with a prompt
That prompt is pre-filled (passed through attention - we’ll see later how that can be batched as well²), forming a KV cache and a token-sized matrix (1 x model-size) that will eventually become the predicted token³
That token-sized matrix goes into a queue
A GPU server pulls batches (e.g. of 128) off that queue, stacks them up into a 128 x model-size matrix, and multiplies them through the feed-forward model weights
The end result is then split into 128 separate tokens
The one for the original request is streamed back to the user
Assuming that token isn’t an end-of-sequence token, return to step 2 to continue generating the next token in the response

Note that the server decides how big a batch size to pull. It’s a tradeoff between throughput and latency. If you do no batching and just process tokens one by one, no user ever waits in a queue (step 3 above), so latency is low (assuming you have enough GPUs). However, if you do a lot of batching, latency is high because users will be waiting until the batch size fills up, but throughput will be much higher because the GPUs are being used more efficiently.

Why are GPUs faster at multiplying large matrices once than small matrices many times? Two reasons. First, there’s some overhead involved in issuing each command to the GPU, and one big multiplication can be launched with a single command. Second, each new GPU command involves fetching weights from memory, which can be expensive for large weights. If you run lots of small GEMMs, you can end up spending most of your time shipping weights in and out of memory instead of computing.

Why are some models tuned for high batch sizes?

Typically an inference server will have a “collection window” where user requests come in and are queued. Chat servers typically aim for 5-10ms, but very high-batch backends might go as wide as 200ms. If a new request comes in at the start of the window, it might wait the entire window duration before being processed⁴. When the window closes, all the queued requests are batched up (i.e. all the 1xmodel-size matrices are concatenated into a single 128xmodel-size matrix) and that batch is sent through the pipeline. Running a batch like this is sometimes called a “tick”.

As the explanation above suggests, you can run any model at any batch size. There’s nothing inherently about the batching process that would rule out some types of model. However, it is possible to build a model so GPU-inefficiently that it effectively needs batching in order to be practical.

Why mixture of experts requires higher batch sizes

For instance, take a mixture-of-experts model (like DeepSeek-V3 or supposedly the original GPT-4). You can get a strong model by training it to have hundreds and hundreds of “experts”: separate blocks of feed-forward weights, from which a routing layer picks a subset that’s used on each token. But a model like this is really GPU-inefficient. We can see why: GPUs want to do a small number of really big matrix multiplications, but if you have many experts you’re forced into many small multiplications. Unless you do your inference in batches, that’s going to mean low throughput.

Let’s think through how a “collection window” of 5ms and 200ms would perform for a large mixture-of-experts model. Suppose you pick up ten user requests in that 5ms window. If you have many experts, some experts might end up only running against one or two tokens (i.e. the batch size for each expert will be much lower than the total set of requests you’ve picked up in your window). If, however, you wait for 200ms and pick up 4000 user requests, you are much more likely to saturate all your experts. At the cost of some latency, you’re making sure that your GEMMs are large and your GPUs are constantly utilized at their maximum capacity.

Why large pipelines require high batch sizes to avoid pipeline bubbles

For large models, it can be a challenge to keep the GPUs active at all. Large models typically have many transformer layers: i.e. hundreds of matrices of weights that make up the feed-forward network. The only way to do fast inference here is to pipeline those layers by having one GPU handle the first ten layers, another handle the next ten, and so on. Otherwise you just won’t be able to fit all the weights in a single GPU’s memory, so you’ll spend a ton of time swapping weights in and out of memory and it’ll end up being really slow. During inference, each token (typically in a “micro batch” of a few tens of tokens each) passes sequentially through that pipeline of GPUs.

How efficient your pipeline is depends on the number of layers you have and the size of your collection window. When you’re processing the tokens in a window during a “tick”, you’ll get some idle GPUs at the start (because GPUs in later layers won’t have anything to work on yet) and some more idle GPUs at the end (when there’s no more tokens in the queue, GPUs in early layers will have to wait for the next “tick”). These periods of idleness are sometimes called “warmup” and “drain”. If you have many small windows, you’re going to spend more GPU time in warmup and drain than if you have fewer large windows. By picking your window size, you’re thus directly trading off between throughput and latency.

If you have a ton of layers and your collection window is really short, you might sometimes end up with fewer tokens to process than layers. This is called a “pipeline bubble” - in effect the “drain” stage starts earlier than usual. You can’t eliminate warmup and drain (for reasons discussed below, inference has to operate in sequential “ticks”), but you can eliminate pipeline bubbles by making your collection window long enough. Pipeline bubbles can be absolutely brutal for model throughput, so inference providers always set their windows wide enough to avoid them. That adds noticeable latency for models with many layers.

Can’t you just keep the queue full?

Why couldn’t inference providers eliminate warmup and drain entirely by keeping the GPU queue full of tokens? In other words, couldn’t you do away with ticks altogether and just keep the token micro-batches flowing? Of course each user’s inference has to be sequential (since you can’t start generating the next token until the current token is done), but large inference providers should have enough concurrent traffic to keep the queue full of separate user requests.

I’ll confess I struggle to see why this shouldn’t be possible in theory. As far as I can tell the practical barrier is how the attention step is batched: if you want to batch up attention GEMMs, they need to all be the same shape (i.e. the same number of prior tokens in the sequence). So you have to run groups of the same shape at the same time, instead of being able to just maintain a single queue. There’s at least some public research on this front, but I wouldn’t be surprised if there were more clever tricks for doing this that I haven’t seen.

Another idea: if you need ticks for the attention step, why not just have a tick-based attention inference system and a more efficient continuous system for the FFN? As I understand it, the reason is memory overhead:

Since the attention output is needed for the FFN, you’d need to have some place in-memory to park it while it waits for its slot in the FFN queue, which would quickly become too expensive.
Modern inference stacks are able to combine the attention and FFN step into a couple of large GEMMs in a single “operation”. If you’re doing these on different GPUs, you have to run different operations and shuttle the weights in and out of memory.

Summary

GPUs are most efficient on large GEMMs, so stacking many tokens into a single matrix multiply gives far higher token throughput than processing them one-by-one
During decoding, attention can only be batched for tokens at the same step, forcing schedulers to run in short “ticks”. How many tokens you pack into a single “tick” (i.e. how long you wait to collect tokens) is your batch size
- These are tokens from different users. You can’t batch tokens from the same user because you need previous tokens to generate the next one, so batching requires a high volume of traffic from different users
Bigger batches raise latency because user tokens might be waiting up to 200ms before the batch is full enough to run, but they boost throughput by allowing larger (and thus more efficient) GEMMs in the feed-forward step
Models with many layers (e.g. long pipelines) need larger batches to avoid pipeline bubbles (by ensuring each tick contains more batches than pipeline steps)
Mixture-of-Experts models need to be served with high-latency to be efficient: each expert sees only the tokens routed to it, so you need larger global batches to keep every expert busy.
Inference providers pick a batch size/window that clears pipeline bubbles and saturates experts. High batch sizes buy you more throughput at the cost of higher latency as tokens wait to fill up the tick
Some models (like DeepSeek’s) that are mixture-of-experts with many layers thus require large batch sizes and high latency, otherwise throughput drops off a cliff. That’s why it’s commonly said that you can’t easily run DeepSeek for personal use: because with a single user running one inference at a time, it runs at very low efficiency/throughput
The fact that OpenAI and Anthropic’s models are quick to respond suggests that either:
- Their models have a more efficient architecture (non-MoE, fewer layers), or
- OpenAI/Anthropic have some very clever tricks for serving inference, or
- they’re paying through the nose for way more GPUs than they strictly need

¹ One commonly-observed strength of transformers is that they can batch prefill within a single user request. When you pass them a long prompt, they can process that prompt all at once because of how the attention mechanism works. Previous recurrent models had to go token-by-token, which was much slower (because it involved many more GEMMs). This has nothing to do with the kind of batching I’m talking about in this post. I’m talking about how you can efficiently batch inference across many different user requests once the prefilling is complete.

² This can also be batched, so long as you’re only batching attention operations with the same number of tokens in the sequence (i.e. every sequence predicting the fourth token can be batched together). Otherwise the size of the KV cache matrices are different, so you can’t easily combine them into a single batch. More on that later.

³ Technically it’s not a token being generated, but the “logits” (a probability distribution across all possible tokens). I’ll say “token” here and later on to keep it simpler.

⁴ Note that in practice modern inference stacks will use “continuous batching”, where a batch is sent off as soon as it’s full instead of waiting for the entire length of the fixed time window. However, the inference is still done in batches, to the core tradeoff between throughput and latency is the same.

If you liked this post, consider subscribing to email updates about my new posts.

Read the original article

ingve

Karma: 208739

@Hacker__News
@hacker._news

Comments

By ryan_glass 2025-06-0118:5913 reply

I run Deepseek V3 locally as my daily driver and I find it affordable, fast and effective. The article assumes GPU which in my opinion is not the best way to serve large models like this locally. I run a mid-range EPYC 9004 series based home server on a supermicro mobo which cost all-in around $4000. It's a single CPU machine with 384GB RAM (you could get 768GB using 64GB sticks but this costs more). No GPU means power draw is less than a gaming desktop. With the RAM limitation I run an Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original. It is around 270GB which leaves plenty of room for context - I run 16k context normally as I use the machine for other things too but can up it to 24k if I need more. I get about 9-10 tokens per second, dropping to 7 tokens/second with a large context. There are plenty of people running similar setups with 2 CPUs who run the full version at similar tokens/second.

By refibrillator 2025-06-0120:592 reply

> Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original

How close are we talking?

I’m not calling you a liar OP, but in general I wish people perpetuating such broad claims would be more rigorous.

Unsloth does amazing work, however as far as I’m aware even they themselves do not publish head to head evals with the original unquantized models.

I have sympathy here because very few people and companies can afford to run the original models, let alone engineer rigorous evals.

However I felt compelled to comment because my experience does not match. For relatively simple usage the differences are hard to notice, but they become much more apparent in high complexity and long context tasks.

By danielhanchen 2025-06-0123:081 reply

Oh hey :) Thanks for the kind words - we did provide benchmarks (MMLU, KLD, Perplexity) for Llama 4 Scout, Gemma 3 27B using our methodology - https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs and https://x.com/UnslothAI/status/1915476692786962441

For R1 specifically, we did an internal benchmark on the original model - https://unsloth.ai/blog/deepseekr1-dynamic

For R1-0528 specifically on evals - we're still running them :)) It's quite expensive to run, so we first do "vibe check" on some internal test cases, and they do pretty well!

But we generally stress the bug fixes that we do, which objectively increase performance by +1 to sometimes +10% accuracy - for example Llama 4 bug fixes, Gemma bug fixes - https://news.ycombinator.com/item?id=39671146 etc are much more important :)

We also provide Q8_0 and Q8_K_XL quants, which are mostly equivalent to FP8 - you can also use the magical `-ot ".ffn_.*_exps.=CPU"` incantation to offload MoE layers to RAM!

By saurik 2025-06-0216:36

> All Distilled and the original R1 versions seem to have accidentally assigned the padding token to <｜endofsentence｜>, which is mostly not a good idea, especially if you want to further finetune on top of these reasoning models. This will cause endless infinite generations, since most frameworks will mask the EOS token out as -100.

I couldn't tell if this was an error in the code running the model or in the model weights themselves; if/assuming the former, are these fixes being upstreamed to anywhere?

By ryan_glass 2025-06-0121:452 reply

You are right that I haven't been rigorous - it's easy to benchmark tokens/second but quality of output is more difficult to nail down. I couldn't find any decent comparisons for Unsloth either. So I just tried a few of their models out, looking for something that was 'good enough' i.e. does all I need: coding, summarizing documents, troubleshooting anything and everything. I would like to see head to head comparisons too - maybe I will invest in more RAM at some stage but so far I have no need for it. I ran some comparisons between the smaller and larger versions of the Unsloth models and interestingly (for me anyway) didn't notice a huge amount of difference in quality between them. But, the smaller models didn't run significantly faster so I settled for the biggest model I could fit in RAM with a decent context. For more complex coding I use Deepseek R1 (again the Unsloth) but since it's a reasoning model it isn't real-time so no use as my daily driver.

By danielhanchen 2025-06-0123:091 reply

Thanks for using our quants and appreciate it :) - We're still doing internal benchmarks since they're very slow to do - but they definitely pass our internal benchmarks :)

By ryan_glass 2025-06-028:21

Thank you for making the dynamic quantisations! My setup wouldn't be possible without them and for my personal use, they do exactly what I need and are indeed excellent.

By ysosirius 2025-06-025:111 reply

How do you find the quality of the output compares to that of, say, o3 or Sonnet 4?

By ryan_glass 2025-06-028:48

To be honest I haven't used o3 or Sonnet as the code I work with is my own proprietary code which I like to keep private, which is one reason for the local setup. For troubleshooting day to day things I have found it at least as good as than the free in-browser version of ChatGPT (not sure which model it uses).

By jeff_carr 2025-06-0119:145 reply

I am impressed. Your personal website is down. HN doesn't allow private messages.

I'm Jeff Carr. I co-founded digital ocean. I assume I can't post email addresses here, but I will try. lets see how smart things are from banning me. I am: wit AT wit com

By p12tic 2025-06-0120:511 reply

State of the art of local models is even further.

For example, look into https://github.com/kvcache-ai/ktransformers, which achieve >11 tokens/s on a relatively old two socket Xeon servers + retail RTX 4090 GPU. Even more interesting is prefill speed at more than 250 tokens/s. This is very useful in use cases like coding, where large prompts are common.

The above is achievable today. In the mean time Intel guys are working on something even more impressive. In https://github.com/sgl-project/sglang/pull/5150 they claim that they achieve >15 tokens/s generation and >350 tokens/s prefill. They don't share what exact hardware they run this on, but from various bits and pieces over various PRs I reverse-engineered that they use 2x Xeon 6980P with MRDIMM 8800 RAM, without GPU. Total cost of such setup will be around $10k once cheap Engineering samples hit eBay.

By qeternity 2025-06-0121:581 reply

It's not impressive nor efficient when you consider batch sizes > 1.

By p12tic 2025-06-0122:001 reply

All of this is for batch size 1.

By qeternity 2025-06-0223:341 reply

I know. That was my point.

Throughput doesn't scale on CPU as well as it does on GPU.

By p12tic 2025-06-0320:36

We both agree. Batch size 1 is only relevant to people who want to run models on their own private machines. Which is the case of OP.

By saagarjha 2025-06-027:27

Pretty sure you can post email addresses here, this is mine: saagar@saagarjha.com. It's more about avoiding spam.

By stavros 2025-06-0214:54

You can post emails fine, you just might get spammed (because it's a public forum).

By adastra22 2025-06-020:03

You can put your email in your profile

By trustinmenowpls 2025-06-020:061 reply

fyi, your website is also down... wit.com doesn't resolve for me

By x______________ 2025-06-020:322 reply

Bold of you to assume that an email domain needs a web server listening on port 80 for http packets..

By trustinmenowpls 2025-06-129:25

I went to his linkedin which has a link to wit.com as his website

By simondotau 2025-06-025:20

You don’t even need an A/AAAA record on the domain.

By twotwotwo 2025-06-0123:03

The latest V3 strikes me as a really practical go-to among open-weights models. Lots of tasks don't need the reasoning tokens, and not having to wait for them is nice. (If something does need it you can always switch.) If you're not running it yourself a couple providers have it with full context, 80tps, and a promise not to use your data.

9004 home server is awesome!

By platevoltage 2025-06-0120:02

Impressive. I need to look more into this. I'm doing my best to limit my LLM usage to what I can run locally.

By nardi 2025-06-0119:123 reply

Whats your prompt processing speed? That’s more important in this situation than output TPS. If you have to wait minutes to start getting an answer, that makes it much worse than a cloud-hosted version.

By ryan_glass 2025-06-0121:09

Prompt eval time varies a lot with context but it feels real-time for short prompts - approx 20 tokens per second but I haven't done much benchmarking of this. When there is a lot of re-prompting in a long back and forth it is still quite fast - I do use KV cache which I assume helps and also quantize the KV cache to Q8 if I am running contexts above 16k. However, if I want it to summarize a document of say 15,000 words it does take a long time - here I walk away and come back in about 20 minutes and it will be complete.

By ryao 2025-06-0119:24

If he is doing multiturn conversations, he can reuse the kv cache from the last turn and skip the prompt processing on the history that would make time to first token too slow, by only doing prompt processing on his actual prompt for the current turn. This turns a quadratic amount of tokens to process into a linear number. I am not sure if this is what he is doing, but that is what I would do if I had his hardware.

By pclmulqdq 2025-06-0119:241 reply

I assume KV caching makes this a non issue, but I'm also curious.

By idonotknowwhy 2025-06-0123:271 reply

If you're just chatting with it starting with "Hi", that's correct. The conversation remains in the KV cache as it grows gradually.

But if you're posting code, writing drafts, or even small snippets of articles, etc in there it becomes a huge problem.

By pclmulqdq 2025-06-020:14

Usually, when people think about the prompt tokens for a chat model, the initial system prompt is the vast majority of the tokens and it's the same regardless for many usage modes. You might have a slightly different system prompt for code than you have for English or for chatting, but that is 3 prompts which you can permanently put in some sort of persistent KV cache. After that, only your specific request in that mode is uncached.

By mechagodzilla 2025-06-0123:021 reply

I use a dual-socket 18-core (so 36 total) xeon with 768GB of DDR4, and get about 1.5-2 tokens/sec with a 4-bit quantized version of the full deepseek models. It really is wild to be able to run a model like that at home.

By stirfish 2025-06-020:271 reply

Dumb question: would something like this have a graphics card too? I assume not

By mechagodzilla 2025-06-0223:32

Yeah, it was just a giant HP workstation - I currently have 3 graphics cards in it (but only 40GB total of VRAM, so not very useful for deepseek models).

By jbellis 2025-06-0120:093 reply

impressive, but that's 1/5 to 1/10 of the throughput that you'd get with a hosted provider, with 1/4 to 1/8 the supported context

By ryan_glass 2025-06-028:29

It might be 5 to 10 times slower than a hosted provider but that doesn't really matter when the output is still faster than a person can read. Context wise, for troubleshooting I have never needed over 16k and for the rare occasion when I need to summarise a very large document I can change up the model to something smaller and get a huge context. I have never needed more than 32k though.

By michelsedgh 2025-06-0120:132 reply

Dude he's running locally, and I think this setup is the best bang for the buck if you wanna run locally, we're not comparing to data centers, you gotta keep it in perspective. That's very impressive results for running local. Thanks for the numbers you saved me a chatgpt search :)

By carstenhag 2025-06-0120:271 reply

Title says: locally it's expensive

Other person says: I had to spend 4000$ and it's still slow

By justsid 2025-06-0123:05

Not to mention that $4000 is in fact expensive. If anything the OP really makes the point of the articles title.

By BoorishBears 2025-06-0123:431 reply

CPU-only is really terrible bang for your buck, and I wish people would stop pushing these impractical builds on people genuinely curious in local AI.

The KV cache won't soften the blow the first time they paste a code sample into a chat and end up waiting 10 minutes with absolutely no interactivity before they even get first token.

You'll get an infinitely more useful build out of a single 3090 and sticking to stuff like Gemma 27B than you will out of trying to run Deepseek off a CPU-only build. Even a GH200 struggles to run Deepseek at realistic speeds with bs=1, and there's an entire H100 attached to CPU there: there just isn't a magic way to get "affordable fast effective" AI out of a CPU offloaded model right now.

By ryan_glass 2025-06-028:351 reply

The quality on Gemma 27B is nowhere near good enough for my needs. None of the smaller models are.

By BoorishBears 2025-06-029:03

And that's fine, but the average person asking is already willing to give up some raw intelligence going local, and would not expect the kind of abysmal performance you're likely getting after describing it as "fast".

I setup Deepseek bs=1 on a $41,000 GH200 and got double digit prompt processing speeds (~50 tk/s): you're definitely getting worse performance than the GH200 was, and that's already unacceptable for most users.

They'd be much better served spending less money than you had to spend and getting an actually interactive experience, instead of having to send off prompts and wait several minutes to get an actual reply the moment the query involves any actual context.

By goldielox 2025-06-069:01

So, in your opinion, hardware wise, as a general purpose tinkering/learning self lab hardware, how would you grade the decked out framework desktop for 2.7k?

By blindriver 2025-06-0123:582 reply

I thought GPUs with a lot of extremely fast memory was required for inference. Are you saying that we can accomplish inference with just a large amount of system memory that is non-unified and no GPU? How is that possible?

By ryan_glass 2025-06-0213:43

Basically it comes down to memory bandwidth of server CPUs being decent. A bit of oversimplification here but... The model and context have to be pulled through RAM (or VRAM) every time a new token is generated. CPUs that are designed for servers with lots of cores have decent bandwidth - up to 480GB/s with the EPYC 9 series and they can use 16 channels simultaneously to process memory. So, in theory they can pull 480GB through the system every second. GPUs are faster but you also have to fit the entire model and context into RAM (or VRAM) so for larger models they are extremely expensive because a decent consumer GPU only has 24GB of VRAM and costs silly money, if you need 20 of them. Whereas you get a lot of RDIMM RAM for a couple thousand bucks so you can run bigger models and 480GB/s gives output faster than most people can read.

By adastra22 2025-06-020:021 reply

I’m confused as to why you think a GPU is necessary? It’s just linear algebra.

By oreoftw 2025-06-020:132 reply

most likely he was referring the fact that you need plenty of GPU-fast memory to keep the model, and GPU cards have it.

By adastra22 2025-06-0214:13

There is nothing magical about GPU memory though. It’s just faster. But people have been doing CPU inference since the first llama code came out.

By 3eb7988a1663 2025-06-0121:291 reply

Do you have hard numbers on the idle/average/max power draw? I assumed that server machines are built as if they are going to red-lined constantly so put less effort into low-utilization optimizations.

By ryan_glass 2025-06-0121:591 reply

No hard numbers I'm afraid in that I don't monitor the power draw. But the machine uses a standard ATX power supply: a Corsair RM750e 750W PSU and the default TDP of the CPU is 280W - I have my TDP set at 300W. It is basically built like a desktop - ATX form factor, fans spin down at idle etc.

By 3eb7988a1663 2025-06-024:49

Approximation is still better than I was expecting. You said supermicro and I was assuming a pizza box with dual power supplies sucking down 1kw at idle. That it can run with a large, but not unreasonable PSU says enough.

By 6Az4Mj4D 2025-06-020:18

Can we run Deepseek using Ollama or something similar for code generation like Github copilot on a 40 core CPU with about 256GB RAM say 200 GB usable for the model?

By dotancohen 2025-06-0122:322 reply

Just curious what your use cases are? What type of texts are you producing?

Thank you.

By ysosirius 2025-06-025:13

I've always wondered this as well, and never seem to get an answer. Why would someone want to do this when they can get a better result either renting in the cloud, or just using a subscription?

Obviously I see the value in having something local from a control and privacy perspective, but it's surely always a net loss in terms of quality and capability of output, right?

By ryan_glass 2025-06-0213:501 reply

Coding, my own proprietary code hence my desire for local hosting, a decent amount of legacy code. General troubleshooting of anything and everything from running Linux servers to fixing my car. Summarizing and translation of large documents occasionally. Also, image generation and other automations but obviously not LLMs for this.

By dotancohen 2025-06-0219:47

Terrific, thank you.

If you don't mind another question, how do you adapt the LLM to your codebase? Keep the whole thing in context? Fine tune on your own code? Fine tune on lots of code in whatever language you're using (e.g. Python, Rust)? Just rely on the original model training?

Thank you very much!

By pclmulqdq 2025-06-0119:231 reply

CPUs are quietly becoming very well-balanced machines for BS 1 inference. The latest Intel Xeons should be at ~20 TPS.

By Spooky23 2025-06-0120:581 reply

A base Mac Mini is ~20 :)

By pclmulqdq 2025-06-0122:43

Oh yeah, I did that math not assuming any quantization. I think if you can get a 3-4 bit quant working + int8 math, ~80 might be achievable.

By ipieter 2025-06-0112:414 reply

This is an interesting blogpost. While the general conclusion ("We need batching") is true, inference of mixture of experts (MoE) models is actually a bit more nuanced.

The main reason we want big batches is because LLM inference is not limited by the compute, but my loading every single weight out of VRAM. Just compare the number of TFLOPS of an H100 with the memory bandwidth, there's basically room for 300 FLOP per byte loaded. So that's why we want big batches: we can perform a lot of operations per parameter/weight that we load from memory. This limit is often referred to as the "roofline model".

As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.

So what MoE allows is expert parallelism, where different nodes keep different experts in memory and don't need to communicate as much between nodes. This only works if there are enough nodes to keep all experts in VRAM and have enough overhead for other stuff (KV cache, other weights, etc). So naturally the possible batch size becomes quite large. And of course you want to maximize this to make sure all GPUs are actually working.

By zozbot234 2025-06-0115:001 reply

You could load different "experts" in a round-robin way on a single node and only aggregate "batches" opportunistically, when you just have multiple requests in-flight that all happen to rely on the same "expert". The difference being that instead of "batches", you would only really have queues. Of course this would come with a sizeable increase in latency, but that's acceptable for many applications (such as for "deep research" workflows)

By jchrisa 2025-06-0115:09

This is very much like Erlang's actor model. The same compute can be run in parallel, or managed via queues. With Erlang's strong support for FFI and process control, I wonder if it's being used as a dispatcher for these sorts of workloads.

By ryao 2025-06-0119:472 reply

> As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.

Inference works by computing layers and then have a very small vector that you send to the next layer as input. When a model does not fit in a single GPU, you just divide it into layers and send the vector over a fabric to the GPU holding the next layer. The transfer happens so quickly that there is a negligible amount of idle time and then the next layer can be computed. The fastest inference on the planet at Cerebras uses this technique to do 2500T/sec on Llama 4 Maverick.

By jimmySixDOF 2025-06-0120:321 reply

Groq and Cerebras both take a big chip approach to architecture and, at least in the case of Groq, they only make economic sense under high batch loads.

https://x.com/swyx/status/1760065636410274162?s=46

By ryao 2025-06-023:37

There is nothing big about Groq’s chips. Their individual chips have only 230 MB RAM. Unlike Cerebras, which can load multiple layers into a single chip, grok must divide a layer across many chips.

By ipieter 2025-06-0222:441 reply

Distributing inference per layer, instead of splitting each layer across gpus, is indeed another approach, called pipeline parallelism. However, per batch there is less compute (only 1 gpu at a time), so inference is slower. In addition, the orchestration of starting the next batch on gpu #0 while gpu #1 starts is quite tricky. For this reason, tensor parallelism as I described is way more common in LLM inference.

By ryao 2025-06-0223:35

In what software? llama.cpp and others divide things by layers.

By cyptus 2025-06-0116:482 reply

could such a network with all its nodes and weights be deployed to an analog circuit and be superfast?

By TuringNYC 2025-06-0122:53

Do you mean something like this? https://www.etched.com/

By rpmisms 2025-06-0118:501 reply

Please go into more detail about this proposal, this piqued my interest in a really strange way.

By cyptus 2025-06-0119:011 reply

The idea is to replicate the weights of the network in the electronics. Somehow like our brains work? This way an analog input signal could lead to a neural network processed output signal without the digital emulation on an gpu. As this is very much simplified, the question is if this could work for modern llms?

By koiueo 2025-06-0120:351 reply

Suddenly "temperature" parameter starts making sense

(If you ever tried fine-tuning an analog circuit, you'll know how finicky the process due to the environment, including temperature)

By cyptus 2025-06-0122:10

haha very true!

By iwontberude 2025-06-0115:172 reply

And this is the investment case for AMD, models fit entirely in a single chassis, and side benefit: less tariffed network equipment to interconnect compute. Map/reduce instead of clustered compute.

Edit: when downvoting, please offer some insight why you disagree

By dragonwriter 2025-06-0116:151 reply

How is the a unique advantage for AMD?

By latchkey 2025-06-0116:543 reply

AMD is consistently stacking more HBM.

  H100 80GB HBM3
  H200 141GB HBM3e
  B200 192GB HBM3e

  MI300x 192GB HBM3
  MI325x 256GB HBM3e
  MI355x 288GB HBM3e

This means that you can fit larger and larger models into a single node, without having to go out over the network. The memory bandwidth on AMD is also quite good.

By ryao 2025-06-0119:322 reply

It really does not matter how much memory AMD has if the drivers and firmware are unstable. To give one example from last year:

https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...

They are currently developing their own drivers for AMD hardware because of the headaches that they had with ROCm.

By latchkey 2025-06-0119:391 reply

"driver" is such a generic word. tinygrad works on mi300x. If you want to use it, you can. Negates your point.

Additionally, ROCm is a giant collection of a whole bunch of libraries. Certainly there are issues, as with any large collection of software, but the critical thing is whether or not AMD is responsive towards getting things fixed.

In the past, it was a huge issue, AMD would routinely ignore developers and bugs would never get fixed. But, after that SA article, Lisa lit a fire under Anush's butt and he's taking ownership. It is a major shift in the entire culture at the company. They are extremely responsive and getting things fixed. You can literally tweet your GH issue to him and someone will respond.

What is true a year ago isn't today. If you're paying attention like I am, and experiencing it first hand, things are changing, fast.

By ryao 2025-06-0120:362 reply

I have been hearing this about AMD/ATI drivers for decades. Every year, someone says that it is fixed, only for new evidence to come out that they are not. I have no reason to believe it is fixed given the history.

Here is evidence to the contrary: If ROCm actually was in good shape, tinygrad would use it instead of developing their own driver.

By DiabloD3 2025-06-024:361 reply

You're conflating two different things.

ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.

The part of ROCm you're interested in is HIP; HIP is the part that does legacy CUDA emulation. HIP will never be complete because Nvidia keeps adding new things, documents things wrong, and also the "cool" stuff people do on Nvidia cards aren't CUDA and it is out of scope for HIP to emulate PTX (since that is strongly tied to how historical Nvidia architectures worked, and would be entirely inappropriate for AMD architectures).

The whole thing with Tinygrad's "driver" isn't a driver at all, its the infrastructure to handle card to card ccNUMA on PCI-E-based systems, which AMD does not support: if you want that, you buy into the big boy systems that have GPUs that communicate using Infinity Fabric (which it, itself, is the HyperTransport protocol over PCI-E PHY instead of over HyperTransport PHY; PCI over PCI-E has no ability to handle ccNUMA meaningfully).

Extremely few customers, AMD's or not, want to share VRAM directly over PCI-E across GPUs since most PCI-E GPU customers are single GPU. Customers that have massive multi-GPU deployments have bought into the ecosystem of their preferred vendor (ie, Nvidia's Mellanox-powered fabrics, or AMD's wall-to-wall Infinity Fabric).

That said, AMD does want to support it if they can, and Tinygrad isn't interested in waiting for an engineer at AMD to add it, so they're pushing ahead and adding it themselves.

Also, part of Tinygrad's problem is they want it available from ROCm/HIP instead of a standards compliant modern API. ROCm/HIP still has not been ported to the modern shader compiler that the AMD driver uses (ie, the one you use for OpenGL, Vulkan, and Direct family APIs), since it originally came from an unrelated engineering team that isn't part of the driver team.

The big push in AMD currently is to unify efforts so that ROCm/HIP is massively simplified and all the redundant parts are axed, so it is purely a SPIR-V code generator or similar. This would probably help projects like Tinygrad someday, but not today.

By ryao 2025-06-025:232 reply

> ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.

AMD says otherwise:

> AMD ROCm™ is an open software stack including drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications.

https://www.amd.com/en/products/software/rocm.html

The issues involving AMD hardware not only applied to the drivers, but to the firmware below the drivers:

https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...

Tinygrad’s software looks like a userland driver:

https://github.com/tinygrad/tinygrad/blob/master/tinygrad/ru...

It loads various firmware blobs, manages part of the initialization process, manages memory, writes to registers, etcetera. These are all things a driver does.

By DiabloD3 2025-06-026:091 reply

AMD is extremely bad at communications. The driver already contains everything ROCm requires to talk to the GPU, and ROCm itself is only a SDK that contains runtimes, libraries, and compilers.

This part of TinyGrad is not a driver, however it tries to hijack the process to do part of that task. You cannot boot the system with this, and it does not replace any part of the Mesa/DRI/DRM/KMS/etc stack. It does reinitialize the hardware with a different firmware, which might be why you think this is a driver.

By ryao 2025-06-027:451 reply

I consider it to be a driver, or at least part of one. Userspace drivers exist. Graphic drivers originally were entirely in userspace, until portions of them were moved into the kernel for kernel mode setting and DRM. These days, graphics drivers themselves have both kernel mode and user mode components. The shader compiler for example would be a user mode component.

By DiabloD3 2025-06-039:341 reply

I'm aware. One of the biggest things in fixing the Linux desktop was no longer needing drivers in the Xserver and needing it to be suid root.

What was linked is written in Python. Nothing in Python is ever going to be a userland driver.

By ryao 2025-06-089:39

There is no reason people cannot write userland drivers in Python.

By latchkey 2025-06-026:081 reply

https://community.amd.com/t5/ai/what-s-new-in-amd-rocm-6-4-b...

ROCm 6.4 software introduces the Instinct GPU Driver, a modular driver architecture that separates the kernel driver from ROCm user space.

By DiabloD3 2025-06-026:20

They were doing this before, the difference with this is, the version of ROCm you use is locked to the driver versions that are supported, which is a very narrow range.

With this new thing, the backend API is now formalized and easier to support wider range of difference.

By latchkey 2025-06-0121:131 reply

We have all been hearing things for decades. Things are noticeably different now. Live in the present, not in the past.

Tinygrad isn’t a driver. It is a framework. It is being developed by George however he wants. If he wants to build something that gives him more direct control over things. Fine. Others might write PTX instead if using higher level abstractions.

Fact is that tinygrad runs not only on AMD, but also Nvidia and others. You might want to reassess your beliefs because you’re reading into things and coming up with the wrong conclusions.

By ryao 2025-06-023:461 reply

I read tinygrad’s website:

https://tinygrad.org/#tinygrad

Under driver quality for AMD, they say “developing” and point to their git repository. If AMD had fixed the issues, they would instead say the driver quality is great and get more sales.

They can still get sales even if they are honest about the state of AMD hardware, since they sell Nvidia hardware too, while your company would risk 0 sales if you say anything other than “everything is fine”, since your business is based on leasing AMD GPUs:

https://hotaisle.xyz/pricing/

Given your enormous conflict of interest, I will listen to what George Hotz and others are saying over what you say on this matter.

By latchkey 2025-06-024:351 reply

Exactly, it is not a driver.

Appreciate you diving more into my business. Yes, we are one of the few that publishes transparent pricing.

When we started, we got zero sales, for a long time. Nobody knew if these things performed or not. So we donated hardware and people like ChipsAndCheese started to benchmark and write blog posts.

We knew the hardware was good, but the software sucked. 16 or so months later, things have changed and sufficiently improved that now we are at capacity. My deep involvement in this business is exactly how I know what’s going on.

Yes, I have a business to run, but at the same time, I was willing to take the risk, when no-one else would, and deploy this compute. To insinuate that I have some sort of conflict of interest is unfair, especially without knowing the full story.

At this juncture, I don’t know what point you’re trying to make. We agree the software sucked. Tinygrad now runs on mi300x. Whatever George’s motivations were a year ago are no longer true today.

If you feel rocm sucks so badly, go the tinygrad route. Same if you don’t want to be tied to cuda. Choice is a good thing. At the end of the day though, this isn’t a reflection on the hardware at all.

By ryao 2025-06-024:551 reply

I hope your business works out for you and I am willing to believe that AMD has improved somewhat, but I do not believe AMD has improved enough to be worth people’s time when Nvidia is an option. I have heard too many nightmares and it is going to take many people, including people who reported those nightmares, reporting improvements for me to think otherwise. It is not just George Hotz who reported issues. Eric Hartford has been quiet lately, but one of the last comments he made on his blog was not very inspiring:

> Know that you are in for rough waters. And even when you arrive - There are lots of optimizations tailored for nVidia GPUs so, even though the hardware may be just as strong spec-wise, in my experience so far, it still may take 2-3 times as long to train on equivalient AMD hardware. (though if you are a super hacker maybe you can fix it!)

https://erichartford.com/from-zero-to-fineturning-with-axolo...

There has been no follow-up “it works great now”.

That said, as for saying you have a conflict of interest, let us consider what a conflict of interest is:

https://en.wikipedia.org/wiki/Conflict_of_interest

> A conflict of interest (COI) is a situation in which a person or organization is involved in multiple interests, financial or otherwise, and serving one interest could involve working against another.

You run a company whose business is dependent entirely on leasing AMD GPUs. Here, you want to say that AMD’s hardware is useful for that purpose and no longer has the deluge of problems others reported last year. If it has not improved, saying such could materially negatively impact your business. This by definition is a conflict of interest.

That is quite a large conflict of interest, given that it involves your livelihood. You are incentivized to make things look better than they are, which affects your credibility when you say that things are fine after there has been ample evidence in the recent past that they have not been. In AMD’s case, poor driver quality is something that they inherited from ATI and the issues goes back decades. While it is believable that AMD has improved their drivers, I find it difficult to believe that they have improved them enough that things are fine now, given history. Viewing your words as being less credible because of these things might be unfair, but there have been plenty of people whose livelihoods depended on things working before you that outright lied about the fitness of products. They even lied when people’s lives were at risk:

https://hackaday.com/2015/10/26/killed-by-a-machine-the-ther...

You could be correct in everything you say, but I have good reason to be skeptical until there has been information from others corroborating it. Blame all of the people who were in similar positions to yours that lied in the past for my skepticism. That said, I will keep my ears open for good news from others who use AMD hardware in this space, but I have low expectations given history.

By latchkey 2025-06-025:441 reply

Funny to see you quoting Eric, he’s a friend and was just running on one of our systems. AMD bought credits from us and donated compute time to him as part of the big internal changes they’re pushing. That kind of thing wouldn’t have happened a year ago. And from his experience, the software has come a long way. Stuff is moving so fast, that you aren't even keeping up, but I am the one driving it forward.

https://x.com/cognitivecompai/status/1929260789208142049

https://news.ycombinator.com/item?id=44154174

And sigh, here we are again with the conflict of interest comments, as if I don’t get it. As I said, you don’t know the full story, so let me spell it out. I’m not doing this for money, status, or fame. I’m fortunate enough that I don’t need a job, this isn’t about livelihood or personal gain.

I’m doing this because I genuinely care about the future of this industry. I believe AI is as transformational as the early Internet. I’ve been online since 1991 (BBS before that), and I’ve seen how monopolies can strangle innovation. A world where one company controls all AI hardware and software is a terrible outcome. Imagine if Cisco made every router or Windows was the only OS. That’s where we’re headed with Nvidia, and I refuse to accept that.

Look at my history and who my investor is, this isn’t some VC land grab. We truly care about decentralizing and democratizing compute. Our priority is getting this previously locked up behind supercomps HPC into the hands of as many developers as possible. My cofounder and I are lifelong nerds and developers, doing this because it matters.

Right now, only two companies are truly competing in this space. You’ve fairly pointed out failures of Cerebras and Groq. AMD is the only one with a real shot at breaking the monopoly. They’re behind, yes. But they were behind in CPUs too, and look where that went. If AMD continues on the path they’re on now, they can absolutely become a viable alternative. Make no mistake, humanity needs an alternative and I'll do my best to make that a reality.

By ryao 2025-06-026:531 reply

Ask Eric to consider writing a new blog post discussing the state of LLM training on AMD hardware. I would be very interested in reading what he has to say.

AMD catching up in CPUs required that they become competent at hardware development. AMD catching up in the GPGPU space would require that they become competent at software development. They have a long history of incompetence when it comes to software development. Here are a number of things Nvidia has done right contrasted with what AMD has done wrong:

  * Nvidia aggressively hires talent. It is known for hiring freshly minted PhDs in areas relevant to them. I heard this firsthand from a CS professor whose specialty was in compilers who had many former students working for Nvidia. AMD is not known for aggressive hiring. Thus, they have fewer software engineers to put on tasks.

  * Nvidia has a unified driver, which reduces duplication of effort, such that their software engineers can focus on improving things. AMD maintains separate drivers for each platform. AMD tried doing partial unification with vulkan, but it took too long to develop, so the Linux community developed its own driver and almost nobody uses AMD’s unified Vulkan driver on Linux. Instead of killing their effort and adopting the community driver for both Linux and Windows, they continued developing their driver that is mostly only used on Windows.

  * Nvidia has a unified architecture, which further deduplicates work. AMD split their architecture into RDNA and CDNA, and thus must implement the same things for each where the two overlap. They realized their mistake and are making UDNA, but the damage is done and they are behind because of their RDNA+CDNA misadventures. It will not be until 2026 that UDNA fixes this.

  * Nvidia proactively uses static analysis tools on their driver, such as coverity. This became public when Nvidia open sourced the kernel part of their Linux driver. I recall a Linux kernel developer that works on static analysis begging the amdgpu kernel driver developers to use static analysis tools on their driver, since there were many obvious issues that were being caught by static analysis tools that were going unaddressed.

There are big differences between how Nvidia and AMD do engineering that make AMD’s chances of catching up slim. That is likely to be the case until they start behaving more like Nvidia in how they do engineering. They are slowly moving in that direction, but so far, it has been too little, too late.

By the way, AMD’s software development incompetence applies to the CPU side of their business too. They had numerous USB issues on the AM4 platform due to bugs in AGESA/UEFI. There were other glitches too, such and memory incompatibilities. End users generally had to put up with it, although some AMD in conjunction with some motherboard vendors managed to fix it the issues. I had an AM4 machine that would not boot reliably with 128GB of RAM and this persisted until I replaced the motherboard with one of the last AM4 motherboards made after suffering for years. Then there was this incompetence that even affected AM5:

https://blog.desdelinux.net/en/Entrysign-a-vulnerability-aff...

AMD needs to change a great deal before they have any hope of competing with Nvidia GPUs in HPC. The only thing going for them in HPC for GPUs is that they have relatively competent GPU hardware design. Everything else about their GPUs have been a disaster. I would not be surprised if Intel manages to become a major player in the GPU market before AMD manages to write good drivers. Intel, unlike AMD, has a history of competent software development. The major black mark on their history would be the initial Windows ARC drivers, but the were able to fix a remarkable number of issues in the time since their discrete GPU launch, and have fairly good drivers on Windows now. Unlike AMD, they did not have a history of incompetence, so the idea that they fixed the vast majority of issues is not hard to believe. Intel will likely continue to have good drivers after they have made competitive hardware to pair with them, provided that they have not laid off their driver developers.

I have more hope in Intel than I have in AMD and I say that despite knowing how bad Intel is at doing anything other than CPUs. No matter how bad Intel is at branching into new areas, AMD is even worse at software development. On the bright side, Intel’s GPU IP has a dual role, since it is needed for their CPU’s iGPUs, so Intel must do the one thing they almost never do when branching into new areas, which is to iterate. The cost of R&D is thus mostly handled by their iGPUs and they can continue iterating on their discrete graphics until it is a real contender in the market. I hope that they merge Gaudi into their GPU development effort, since iterating on ARC is the right way forward. I think Intel having an “AMD moment” in GPUs is less of a longshot than AMD’s recovery from the AM3 fiasco and less of a long shot than AMD becoming competent at driver development before Intel either becomes good at GPGPU or goes out of business.

By latchkey 2025-06-027:051 reply

Trying to find fault over UDNA is hilarious, they literally can't win with you.

My business model is to support viable alternatives. If someone else comes along and develops something that looks viable and there is customer demand for it, I'll deploy it.

You totally lost me at having more hope with Intel. I'm not seeing it. Gaudi 3 release was a nothing burger and is only recently deployed on IBM Cloud. Software is the critical component and if developers can't get access to the hardware, nobody is going to write software for it.

By ryao 2025-06-027:27

I fixed some autocorrect typos that were in my comment. I do not find fault with UDNA and I have no idea why you think I do. I find fault with the CDNA/RDNA split. UDNA is what AMD should have done in the first place.

As for Gaudi 3, I think it needs to be scrapped and used as an organ donor for ARC. In particular, the interconnect should reused in ARC. That would be Intel’s best chance of becoming competitive with Nvidia.

As for AMD becoming competitive with Nvidia, their incompetence at software engineering makes me skeptical. They do not have enough people. They have the people that they do have divided into to many redundant things. They do not have their people doing good software engineering practices such as static analysis. They also work the people that they do have with long hours (or so I have read), which of course is going to result in more bugs. They need a complete culture change to have any chance of catching up to Nvidia on the software side of things.

As for Intel, they have a good software engineering culture. They just need to fix the hardware side of things and I consider that to be much less of a stretch than AMD becoming good at software engineering Their recent battlematrix announcement is a step in the right direction. They just need to keep improving their GPUs and add an interconnect to fulfill the role of nvlink.

By faldore 2025-06-0122:201 reply

That was last year Mi300x firmware and software have gotten much better since then

By ryao 2025-06-024:282 reply

Unfortunately, AMD and ATI before it have had driver quality issues for decades; and both they and their fans have claimed that they have solved the problems every year since.

Even if they have made progress, I doubt that they have reached parity with Nvidia. I have had enough false hope from them that I am convinced that the only way that they will ever improve their drivers if they let another group write the drivers for them.

Coincidentally, Valve has been developing the Vulkan driver used by SteamOS and other Linux distributions, which is how SteamOS is so much better than Windows. If AMD could get someone else to work on improving their GPGPU support, we would likely see it become quite good too. Until then, I have very low expectations.

By latchkey 2025-06-024:421 reply

Gaming != HPC

By ryao 2025-06-025:161 reply

GPUs were originally designed for gaming. Their ability to be used in HPC grew out of that. The history of issues goes back rather far.

By latchkey 2025-06-025:201 reply

Thanks, I totally had no idea what the G stood for. /s

Seriously though, you’re clearly stuck in the past. This is tech. It evolves fast.

Holding onto grudges just slows you down.

By ryao 2025-06-025:301 reply

The G stood for graphics.

As for being stuck in the past, I got fed up in 2006 after 8 years of nothing but ATI graphics. I spent years hoping that the issues would be fixed after the latest update, but they never were. I had a fairly problem free experience after switching to Nvidia. When issues did occur, Nvidia fixed them within months. While enjoying the relatively problem free experience on Nvidia, I would hear people claim everything was fixed on ATI (and later AMD), only to hear waves of people complaining about issues. Then Valve got involved with the driver development for AMD graphics and made the Steam deck. I brought one and it has been fantastic. I still hear about numerous problems involving drivers AMD wrote (especially their windows drivers), but I am using drivers that were in part authored by Valve, and Valve fixed the issues AMD was incapable of fixing themselves.

You claim that things are fine for HPC on AMD graphics hardware, but I have reason to be skeptical given that numerous people have reported severe problems just last year with no follow up that the various headaches have been fixed.

Also, you have repeatedly claimed that tinygrad’s software is not a driver, yet I see a userland driver here:

https://github.com/tinygrad/tinygrad/blob/master/tinygrad/ru...

As I have said elsewhere: It loads various firmware blobs, manages part of the initialization process, manages memory, writes to registers, etcetera. These are all things a driver does.

I am going to listen to others and my own eyes over you on these matters.

By latchkey 2025-06-026:11

¯\_(ツ)_/¯

By krapht 2025-06-0118:441 reply

So the MI300x has 8 different memory domains, and although you can treat it as one flat memory space, if you want to reach their advertised peak memory bandwidth you have to work with it like an 8-socket board.

By latchkey 2025-06-0119:30

Here is a good article on it:

https://rocm.blogs.amd.com/software-tools-optimization/compu...

By dragonwriter 2025-06-022:071 reply

MI355X isn't out yet, and the upcoming B300 also has 288GB HBM3e

By latchkey 2025-06-022:58

June 12th.

B300 is Q4 2025.

Yes, they keep leapfrogging each other. AMD is still ahead in vram.

By rixed 2025-06-025:11

> when downvoting, please offer some insight why you disagree

And remind that (down)voting is not for (dis)agreement.

By perching_aix 2025-06-0110:407 reply

For those looking to save time, the answer is batched inference. Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.

This is also why you may experience a variance in replies when using these services, even when you set the temperature to 0 and the seed to a fixed value. It's cause you don't control the other prompts yours get batched with. Could this be a data exfiltration attack vector? Probably, I didn't "research" that far.

By yjftsjthsd-h 2025-06-0111:541 reply

> Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.

I naively assumed providers did that with all models. Or does it only work for this (family of?) model(s)?

By hansvm 2025-06-0114:27

It works for a lot of families but not all. You need a high enough degree of sharing of model weights between different queries for that to make sense (memory access being the usual bottleneck nowadays, though smaller models see something similar with matmul batch efficiencies for CPU related reasons).

Fully connected transformers trivially work (every weight for every query). MoE works beyond a certain size or with certain types of mixing (still using every weight, or using a high enough fraction that there's some sharing with batches of 20+ queries). As you push further that direction though (lots of techniques, but the key point being accessing less of the model at once and bypassing some of it for each query), you need larger and larger batches for those efficiency gains to materialize. At some point it becomes untenable because of latency waiting for batches of data, and past that it becomes untenable because of the volume of query data.

By larodi 2025-06-0114:46

Batching. Yes.

And one thing it can help you locally is when you rate certain content and want to make sure it didn’t hallucinate. So you toss 3 or 5 times or… batch_size times .)

Curious that batch if has been there from day one, but it takes a while for people to see/grasp/grok it.

By pcwelder 2025-06-0110:505 reply

> other prompts yours get batched with

Why would batching lead to variance?

By kouteiheika 2025-06-0112:213 reply

> Why would batching lead to variance?

Depending on the shape of the data a slightly different kernel implementation (for e.g. matrix multiplication, etc.) will be the most optimal, and those will give slightly different results. There could also be other sources of non-determinism depending on the implementation (e.g. some kernels are inherently not entirely deterministic as they use tricks to go faster).

By zxexz 2025-06-0112:52

Yep, this. I see a lot of other worryingly confident answers in the thread that are wrong.

SGLang finally has at least some notes[0], but I’m always surprised there isn’t more of a community wide effort to trace down the sources of indeterminism.

[0] https://docs.sglang.ai/references/faq.html

By delusional 2025-06-0119:28

> not entirely deterministic

There's a Nobel prize waiting for you if that's the case. I'll assume you meant theoretically consistent or accurate.

By bhickey 2025-06-0113:511 reply

Some of the non-determinism mentioned above manifests as sensitivity to _where_ data falls within a batch.

By tough 2025-06-0118:11

In my experience with other regular models, once the context starts to fill up, quality starts to degrade.

wouldn't getting batched at the end of a batch, have a similar -effect- on the results, where your prompt might recieve overall less attention focused into it, if the context window is almost full?

Idk just going by the vibes

By imtringued 2025-06-0113:46

Attention doesn't get batched and the runtime of attention for a given users token depends on the total context length. Hence even in the ideal scenario of you getting a dedicated attention calculating GPU, the MLP calculating GPU doing batching will have to wait for the slowest user.

In the worst case scenario you are sharing a single attention calculating GPU with someone who has a super long context window, then that guy will be hogging most of the memory bandwidth of the GPU, even though you both are generating the same quantity of tokens.

This means that in the distributed setting, you will not only need dedicated GPUs for the model and attention calculations, you will also need to duplicate the whole setup for a variety of context lengths, so that long contexts are batches alongside other long contexts and short contexts are batches alongside other short contexts.

By jerpint 2025-06-0111:291 reply

Batching can lead to variance with things like batchnorm but most transformers use layer norm to avoid this problem

By amelius 2025-06-0113:44

Batchnorm can only have an effect between batches during training, not inference.

By Hendrikto 2025-06-0111:132 reply

Because these models are context-sensitive. Every token can influence the output.

By immibis 2025-06-0112:37

But not the tokens that don't even feed into your output because they're feeding into someone else's output. Separate items in batches don't get mixed up with each other - they just run the model separately on each item at the same time, like SIMD.

By simianwords 2025-06-0111:181 reply

I believe they are talking about latency variance. Batching can increase variance because you may have to wait for enough prompts to get to the batch size.

By perching_aix 2025-06-0111:221 reply

No, I meant that the responses will be different run-to-run. [0]

[0] https://152334h.github.io/blog/non-determinism-in-gpt-4/

By exe34 2025-06-0112:511 reply

Variance based on actual randomness would be one thing, but to me variance based on what other people are running seems concerning, for reasons I can't quite articulate. I don't want the model to reply to a question in one domain based on what a large group of other people are thinking in a different domain (e.g. if they're discussing the news with chatgpt).

By zackangelo 2025-06-0115:511 reply

This definitely happens, and I'm surprised it's not talked about more often. Some attention kernels are more susceptible to this than others (I've found that paged attention is better than just naive attention, for example).

By exe34 2025-06-0117:25

To be fair, I suppose people do it too - if you ask me a question about A, often as not the answer will be coloured by the fact that I just learnt about B.

By empiko 2025-06-0112:56

In some mixture-of-experts approaches, samples or tokens are being distributed among experts. The experts are selected by trying to predict what is a good expert-sample match. Depending on your neighbors in the batch, you might be assigned different experts.

By VectorLock 2025-06-0112:29

Sounds like an amazing attack vector if your prompts get mixed with other's.

By energy123 2025-06-0114:38

What's the average batch size?

By taneq 2025-06-0112:582 reply

Wow, almost like Deepseek’s impressive performance is the result of optimisation by smart engineers.

By perching_aix 2025-06-0113:341 reply

Not sure why the snarky tone, didn't say or imply otherwise, nor did anyone else in the thread so far that I could see.

By taneq 2025-06-031:28

It wasn't meant to come across that snarky. Sorry about that. :/

By draw_down 2025-06-0114:27

[dead]