Executing programs inside transformers with exponentially faster inference

Comments

By btown 2026-03-139:551 reply

This seems way cooler than just computation (which is easy to hand off to a tool, and arguably more predictable that way). The broader point here is that you can have your model switch dynamically to/from a kind of attention that scales with the log of the token count, by only exploring the convex hull in a 2D space. A less capable version of attention, to be sure, but one capable of tracing a program’s execution with text representations of registers and stack - which is a meaningful level of flexibility, and one many humans would find difficult to do reliably!

What could you do with an LLM that can go into “focus mode” and generate tokens extremely rapidly? How much more powerful would a reasoning-token-generation phase be that can explore and cull large numbers of paths/hypotheses, so long as they are well defined? Does this have implications for multi-modal models and spatial reasoning?

As the paper suggests:

> These models could be useful in several modes: as a dedicated fast path paired with a slower, more general model; as part of a fast/slow hybrid architecture inside a single system; or as a speculative execution model that proposes tokens quickly while a regular-attention model verifies and accepts them. Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models.

By itigges22 2026-03-1318:47

[flagged]

By derangedHorse 2026-03-1316:322 reply

I initially agreed with a lot of the sentiment that asks "why," but have reframed my opinion. Instead of seeing this as a way to run programs via inference, I'm now seeing this as a way to bootstrap training. Think about the task of classification. If I have an expert system that classifies correctly 80% of the time, now I can embed it into a model and train the model to try to raise the success rate. The lower we can make the cost of training on various tasks, the better it levels the playing field of who can compete in the AI landscape.

By yorwba 2026-03-1317:24

The approach here is very bad for training though, because unlike softmax attention, average-hard attention is not differentiable with respect to the keys and queries, and if you try to fix that e.g. with straight-through estimation, the backward pass cannot be sped up in the same way as the forward pass.

By refulgentis 2026-03-1317:55

Training is ruled out (see peer comment), however you may find this fascinating, somewhat rhymes: https://arxiv.org/abs/2603.10055

By teiferer 2026-03-1316:064 reply

Why would that be desirable?

If we take the human brain as an example, it's pretty bad at computation. Multiply two 10-digit numbers takes forever, despite the enormous size of its neural network. It's not the right tool for the job - a few deterministic logic gates could do that much more efficiently. That same circuit can't do much else, but multiplying, oh boy, it's good at that! Why do we think that artificial neural nets would be the right tool for that job? What's wrong with letting the LLM reach out to an ALU to do the calculation, just like a human would do? It's surely going to be quicker and require less energy.

By soerxpso 2026-03-1316:14

The embedded programs can be connected to the other weights during training, in whatever way the training process finds useful. It doesn't just have to be arithmetic calculation. You can put any hard-coded algorithm in there, make the weights for that algorithm static, and let the training process figure out how to connect the other trillion weights to it.

By OneDeuxTriSeiGo 2026-03-1316:48

One of the big appeals of this is it gives a mechanism for "teaching" models a geometric intuition and better spacial reasoning.

Not necessarily pure number crunching but the boundary between rote algorithms and fuzzy intuition based models that humans in particular excel at.

By pegasus 2026-03-1316:201 reply

> Why would that be desirable?

If we never try, we'll never know. I wouldn't be surprised if there is something to gain from a form of deterministic computation which is still integrated with the NN architecture. After all, tool calls have their own non-trivial overhead.

By teiferer 2026-03-1316:22

Trying, sure. That's what hackers do.

I'm asking whether it's a desirable end state.

By itigges22 2026-03-1318:44