Hacker News

I don't know how you get here from “predict the next word”

2026-02-264:59168260www.grumpy-economist.com

I recently tried refine, an AI tool for refining academic articles, developed by Yann Calvó López and Ben Golub.

Show article

I recently tried refine, an AI tool for refining academic articles, developed by Yann Calvó López and Ben Golub. I sent it the current draft of my booklet on inflation, to see what it can offer. I just used it once so far, with the free trial mode. I will be a regular user forever.

The results are stunning. The comments it offered were on the par of the best comments I’ve received on a paper in my entire academic career. And more concise and organized than the best ones. They aren’t perfect, but the kind of analysis the program is able to do is past the point where technology looks like magic. I don’t know how you get here from “predict the next word.”

Here are its main comments:

Operationalizing the “fiscal news” narrative

The core episode claim in Ch. 2.3 is that inflation “started” when agents learned in early 2021 that a material share of deficits would not be repaid, and “ended” in mid-2022 when expectations shifted regarding future unfunded programs. As currently written, the evidentiary support risks circularity: the text leans on the occurrence of inflation to infer that debt was perceived as “unbacked,” while offering mostly qualitative narrative signals.

To carry the identification burden, this narrative requires operationalizing. Ch. 4 explicitly claims recent history offers unusually clean “experiments.” To substantiate this, the text needs a tighter set of dated, externally visible observables—specific legislative hurdles, “debt doesn’t matter” rhetoric, or election probabilities—tied to sign and timing predictions for inflation expectations and long-bond prices. Without pinning the “news” to specific dates distinct from the inflation series itself, the episode reads as a post hoc overlay rather than a discriminating application of the FTPL mechanism.

The program found the core episode and summarizes the main argument of an 80 page paper. I did write a disclaimer that finding separate measures of the present value of surpluses is not possible, and that my goal was to establish plausible rather than proof. But I was lazy, and the program suggests the obvious way in which I should tighten up the presentation, within the realm of the possible. Back to work.

Clarifying the fiscal regime distinction in the “completeness” claim
The manuscript repeatedly asserts completeness, noting in Ch. 3 (”Now, stop and admire what we have accomplished. We have a full, complete… theory…”) and Ch. 4 (”the only known, complete, economic theory… consistent with current institutions”). However, Ch. 4.5 also correctly notes that “All of the equations… are in fact the same” between FTPL and New Keynesian (NK) models.
For the academic audience, the claim that the standard NK model “Doesn’t Work” (Ch. 6.1) or is incomplete creates immediate friction unless the distinction regarding regime closure is explicit early on. The substantive disagreement is not the presence or absence of the valuation equation, but the fiscal regime: does the equation determine prices (FTPL), or does it determine future surpluses via a “passive” fiscal assumption (NK)? Reframing the critique would strengthen the argument. Rather than stating the NK model fails on internal logic, the text should argue it requires an empirically untenable assumption of passive fiscal behavior during the 2020–2022 episode.

The point is fair, and a good case of how the program improves writing. There is a tension between the observational equivalence theorem and my claim that recent experiments tell us which model is right. The answer is in the paper — observational equivalence is only about observed time series, and we can use what we know about how central banks behave from other sources to think about what regime is plausible. Central banks don’t threaten explosive behavior to select equilibria. But that’s spread out in the paper. I didn’t really stress the causal interpretation of the valuation equation issue — does the valuation equation express how inflation causes surpluses or how surpluses cause inflation. (Like the 50 year old debate whether MV=PY expresses how money supply causes inflation or inflation causes money demand accommodated by supply.) The model picked this up on its own, likely by being trained on previous FTPL controversies. Back to work

Resolving ambiguity in the transmission mechanism
There is a tension in how the text describes the transmission of interest rate hikes. In Ch. 3.1 and Fig. 3.2, the argument is that rate hikes can produce short-run disinflation by lowering the nominal market value of outstanding bonds (the numerator in the valuation equation). However, Ch. 2.2 and Ch. 4.5 emphasize that discount rates and interest costs are central to the present value of surpluses (the denominator).
A rise in real rates that compresses bond prices also generically lowers the present value of surpluses via discounting or raises interest costs. This can push the price level upward, creating a “stepping on a rake” effect where the central bank shifts inflation across time. The policy conclusions (”should promptly raise rates,” Ch. 3.1–3.2) need to be reconciled with this mechanism. The text should explicitly partition what the central bank can achieve alone (via the valuation effect on the numerator) versus what requires fiscal adjustment, and clarify when the disinflationary channel is generic versus conditional on duration duration.

Here too, the model caught my habit of introducing an idea and then later refining it. The long-term debt mechanism is a way that higher future interest rates can result in lower inflation today. And the typical persistent shock links higher current and future interest rates. But sticky prices work against that mechanism. With sticky prices, higher nominal rates mean higher real rates which means higher interest costs on the debt which raises inflation. I sort of dribbled that out in several places. This may or may not be a good idea. I don’t like to overwhelm readers right away. But the program makes me think about that choice.

And more, which I won’t bore you with.

It also found algebra errors, such as a negative sign that slipped in the solution of a differential equation.

And that’s the free mode! I’ll rush back in paid mode (much more extensive) after I finish incorporating these comments.

****

This is the first time I’ve seen AI at work in something I do daily, and it really is revolutionary.

Refereeing and evaluating papers is one of the more unpleasant and time consuming tasks in our profession. I’ve read a lot of referee reports in my 40 years as an economist, and this is top 5% for sure. Most referee reports do not identify the major point of the paper, and do not assess if the paper backs up that point. They do not notice glaring gaps of logic, basic theorems violated, econometrics advice 101 ignored. Editors are lucky if one out of three reports is vaguely useful. Clearly, this task is going to be radically impacted by AI. If I were an editor, I’d feed every paper to refine on receipt. Or, I would require the author to spend the $50 and send in the last refine report! I will surely get refine’s opinion before any referee report I write in the future.

Will all the referees be out of jobs? No! You still have to read and evaluate what refine offers. But the speed, accuracy, and quality of reports will jump. And economists will save a lot of time.

Before anyone asks me for comments on a paper, I’m going to ask “did you submit it to refine?” That will save a lot of time too.

And it should produce much better written papers, which will also save a lot of time.

It does feel weird writing defensive prose to an AI program, as I do to the thoughtful humans who offer me comments. The last comment brings this up

Strengthening the discrimination against the Monetarist alternative
Ch. 4.2’s flagship discrimination claim is that Monetarism predicts QE should be inflationarily equivalent to helicopter transfers, and that the observed contrast falsifies Monetarism. However, the manuscript emphasizes the institutional reality of Interest on Reserves (IOR) and “ample reserves.” A sophisticated “money view” would invoke precisely these facts to argue that reserves and T-bills are near-perfect substitutes under IOR, making QE neutral while transfers are not.
If the Monetarist benchmark—invoked here to be knocked down—is not the strongest version consistent with the institutional realism the book champions, the adjudication in Ch. 4 loses credibility. To persuade the readers most likely to scrutinize this section, the text should address why the FTPL explanation dominates even a sophisticated Monetarist view that accounts for IOR, rather than only the primitive version that ignores it.

Yes, I did leave out the liquidity value of treasury debt. But is there really a serious school of thought that thinks treasury debt is a perfect substitute for money, so that we should think of the whole stock of treasury debt as the monetary base? That would mean open market operations are completely irrelevant, undoing a lot of standard monetarism. And the famous “stability” of the M2-nominal GDP ratio is totally absent for total debt, which has fluctuated form 25% of GDP in the 1970s to 100% now. Velocity shocks? Perhaps the paid version will send me some cites so I know if I really have to deal with it.

But one way or another it is astonishing that a computer program came up with this logical possibility, which is at least a worthy whatabout seminar comment from the back of the room.

In the meantime, I also tried Claude to update some graphs. My prompt was just “write a matlab program that fetches data series xyz from Fred using the API, and make a graph that..” with pretty detailed description of the graph. It ran right out of the box, even doing a decent job of “put text labels on the graph in a way that doesn’t conflict with the plotted time series.” Claude did not do a good job of finding which Fred data series would work, but that was a small task. And it produced code using a lot of commands I don’t recognize. Making sure programs do what you think they do will be a new challenge. It went on and did things I didn’t ask for, like offer summary statistics! Still, an hour job took 5 minutes.

This is all old news to most of my colleagues, who are integrating AI into workflows with great speed. But if you’re not using these tools, the time to start is now.

(I used no AI to write this substack, and have not done so in the past. I’ll acknowledge when I do. As I included refine in the thanks of the inflation booklet.)

Update

On reflection I have started to worry again. In 10 to 20 years nobody will read anything any more, they just will read LLM digests. So, the single most important task of a writer starting right now is to get your efforts wired in to the LLMs. Nothing you write will matter if it is not quickly adopted to the training dataset. As the art of pushing your results to the top of the google search was the 1990s game, getting your ideas into the LLMs is today’s. Refine is no different. It’s so good, everyone will use it. So whether refine and its cousins take a FTPL or new Keynesian view in evaluating papers is now all determining for where the consensus of the profession goes.

Communicating with a journal editor, I can see the usefulness of a version tailored for evaluating papers. For example, “did the author incorporate and respond to the referee’s comments?” “do the referee’s comments make any sense?” “which of the referee’s comments address the paper’s correctness or importance, and which are merely suggestions for further work?” and so forth. However, capture of the LLM still strikes me as a potential problem, either by a wing —economists love nothing more than a methodological fight, “this paper uses outdated structural/reduced form methodology,” “this paper ignores important behavioral/general equilibrium analysis” etc — or by the “settled science” — imagine the LLM reaction in recent years to climate, gender, public health, inequality, etc. papers if trained on the “consensus.” I should test refine on some controversial topics. Also how it does on the bullshit benchmark, a really important problem in economics academia. We need a quantitative bullshit evaluation — bs wrapped up in fancy equations.

Read the original article

qsi

Karma: 1175

@Hacker__News
@hacker._news

Comments

By wavemode 2026-02-266:203 reply

> the kind of analysis the program is able to do is past the point where technology looks like magic. I don’t know how you get here from “predict the next word.”

You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.

All that being said, the refine.ink tool certainly has an interesting approach, which I'm not sure I've seen before. They review a single piece of writing, and it takes up to an hour, and it costs $50. They are probably running the LLM very painstakingly and repeatedly over combinations of sections of your text, allowing it to reason about the things you've written in a lot more detail than you get with a plain run of a long-context model (due to the limitations of sparse attention).

It's neat. I wonder about what other kinds of tasks we could improve AI performance at by scaling time and money (which, in the grand scheme, is usually still a bargain compared to a human worker).

By jjmarr 2026-02-267:043 reply

I created a code review pipeline at work with a similar tradeoff and we found the cost is worth it. Time is a non-issue.

We could run Claude on our code and call it a day, but we have hundreds of style, safety, etc rules on a very large C++ codebase with intricate behaviour (cooperative multitasking be fun).

So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.

"scaling time" on the other hand is useless. You can just divide the problem with subagents until it's time within a few minutes because that also increases quality due to less context/more focus.

By aktau 2026-02-2613:393 reply

Any LLM-based code review tooling I've tried has been lackluster (most comments not too helpful). Prose review is usually better.

> So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.

Sure, you could make multiple LLM invocations (different temporature, different prompts, ...). But how does one separate the good comments from the bad comments? Another meta-LLM? [1] Do you know of anyone who summarizes the approach?

[1]: I suppose you could shard that out for as much compute you want to spend, with one LLM invocation judging/collating the results of (say) 10 child reviewers.

By DustinKlent 2026-02-2614:04

I have attempted to replicate the "workflow" LLM process where several LLMs come up with different variations of a way to solve a problem and a "judge" LLM reviews them and the go through different verification processes to see if this workflow increased the accuracy of the LLM's ability to solve the problem. For me, in my experiments, it didn't really make much difference but at the time I was using LLMs significantly dumber than current frontier models. HOWEVER...When I enable "Thinking Mode" on frontier LLM's like ChatGPT it DOES tend to solve problems that the non-thinking mode isn't able to solve so perhaps it's just a matter of throwing enough iterations at it for the LLM to be able to solve a particular complex problem.

By ivansavz 2026-02-2614:53

> But how does one separate the good comments from the bad comments?

One thing that works very well for me (in a different context) is to ask to return two lists:

- Things that I must absolutely fix (bugs, typos, logic mistakes, etc.)

- Lesser fixes and other stylistic improvements

Then I look only at the first list.

By jjmarr 2026-02-2618:21

You need human alignment on what constitutes a "good" comment. That means consistent rules.

Otherwise, some people feel review is too harsh, other people feel it is not harsh enough. AI does not fix inconsistent expectations.

> But how does one separate the good comments from the bad comments?

If the AI took a valid interpretation of the coding guidelines, it is a legitimate comment. If the AI is being overly pedantic, it is a documentation bug and we change the rules.

By smallpipe 2026-02-268:461 reply

> This has completely replaced human code review for anything that isn't functional correctness

Isn’t functional correctness pretty much the only thing that matters though?

By grey-area 2026-02-2610:111 reply

Well no, style is important too for humans when they read a codebase, so the LLMs the parent is running clearly have some value for them.

They're not claiming LLMs solved every problem, just that they made life easier by taking care of busywork that humans would otherwise be doing. I think personally this is quite a good use for them - offering suggestions on PRs say, as long as humans still review them as well.

By 1718627440 2026-02-2611:041 reply

But isn't style already achievable by running e.g. GNU indent?

By jjmarr 2026-02-2618:26

Some examples of complex transformations linters can't catch:

* Function names must start with a verb.

* Use standard algorithms instead of for loops.

* Refactor your code to use IIFEs to make variables constexpr.

The verb one is the best example. Since we work adjacent to hardware, people like creating functions on structs representing register state called "REGISTER_XYZ_FIELD_BIT_1()" and you can't tell if this gets the value of the first field bit or sets something called field bit to 1.

If you rename it to `getRegisterXyzFieldBit1()` or `setRegisterXyzFieldBitTo1()` at least it becomes clear what they're doing.

By Kim_Bruning 2026-02-268:082 reply

> You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.

I made a cursed CPU in the game 'Turing Complete'; and had an older version of claude build me an assembler for it?

Good luck finding THAT in the training data. :-P

(just to be sure, I then had it write actual programs in that new assembly language)

By withinboredom 2026-02-268:321 reply

But the ideas are not 'new'. A benchmark that I use to tell me if an AI is overfitted is to present the AI with a recent paper (especially one like a paxos variant) and have it build that. If it writes general paxos instead of what the paper specified, its overfitted.

Claude 4.5: not overfitted too much -- does the right thing 6/10 times.

Claude 4.6: overfitted -- does the right thing 2/10 times.

OpenAI 5.3: overfitted -- does the right thing 3/10 times.

These aren't perfect benchmarks, but it lets me know how much babysitting I need to do.

My point being that older Claude models weren't overfitted nearly as much, so I'm confirming what you're saying.

By Kim_Bruning 2026-02-269:331 reply

Could also be that the model has stronger priors wrt Paxos (and thus has Opinions on what good Paxos should look like)

At any rate, with an assembler, you end up with a lot of random letter-salad mnemonics with odd use cases, so that is very likely to tokenize in interesting ways at the very least.

By withinboredom 2026-02-2610:40

I was just using paxos as an example. Any paper will do.

By selridge 2026-02-266:272 reply

>You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data.

This is just as stuck in a moment in time as "they only do next word prediction" What does this even mean anymore? Are we supposed to believe that a review of this paper that wasn't written when that model (It's putatively not an "LLM", but IDK enough about it to be pushy there) was trained? Does that even make sense? We're not in the regime of regurgitating training data (if we really ever were). We need to let go of these frames which were barely true when they took hold. Some new shit is afoot.

By wavemode 2026-02-266:347 reply

Statistical models generalize. If you train a model that f(x) = 5 and f(x+1) = 6, the number 7 doesn't have to exist in the training data for the model to give you a correct answer for f(x+2)

Similarly, if there are millions of academic papers and thousands of peer reviews in the training data, a review of this exact paper doesn't need to be in there for the LLM to write something convincing. (I say "convincing" rather than "correct" since, the author himself admits that he doesn't agree with all the LLM's comments.)

I tend to recommend people learn these things from first principles (e.g. build a small neural network, explore deep learning, build a language model) to gain a better intuition. There's really no "magic" at work here.

By kristiandupont 2026-02-266:531 reply

I had Claude help me get a program written for Linux to compile on macOS. The program is written in a programming language the author invented for the project, a pretty unusual one (for example, it allows spaces in variable names).

Claude figured out how the language worked and debugged segfaults until the compiler compiled, and then until the program did. That might not be magic, but it shows a level of sophistication where referring to “statistics” is about as meaningful as describing a person as the statistics of electrical impulses between neurons.

By compass_copium 2026-02-267:012 reply

But the programming language has explicitly laid out rules. It was not trained on those sets of rules, but it was trained on many trillions of lines of code. It has a map of how programs work, and an explanation of this new language. It's using training data and data it's fed to generate that result.

By selridge 2026-02-267:093 reply

What doesn't that explain tho?

What behavior would you need to see for that explanation to no longer hold? Because it seems like it explains too much.

By BobaFloutist 2026-02-2617:351 reply

I don't know how you'd prompt this, but if there was a clean example of an A.I. coming up with an idea that's completely novel in more than details, it would be compelling evidence that these next-token predictors have some weird emergent properties that don't necessarily follow from intricate, sophisticated webs of token-prediction.

E.g. "What might be a room-temperature superconductor" -> "some plausible iteration on existing high-temperature superconductors based on our current understanding of the underlying physics" would not be outside how we currently understand them.

"What might be a room-temperature superconductor?" -> "some completely outlandish material that nobody has studied before and, when examined, seems to have higher temperature superconducting than we would predict" would provoke some serious questions.

A fun experiment I've heard suggested is training a model on all scientific understanding just up to some counterintuitive quantum leap in scientific understanding, say, Einstein's theory of relativity, and then seeing if you can prompt it to "discover" or "invent" said leap, without explicitly telling it what to look for. This would of course be pretty hard to prove, but if you could get it to work on a local model, publish the training set and parameters so that anyone can replicate it on their own machine, that could be pretty darn compelling.

By selridge 2026-02-2618:42

Why would it matter whether or not the robot looks something up if it makes a novel discovery?

Why would it matter that the discovery wasn't just novel but felt like an unconventional one to me, someone who is probably a total outsider to that field?

Both of those feel subjective or at least hard to sustain.

Look. What I'm trying to tell people is that the easy explanations for how these models worked circa GPT-2 is just not cutting it anymore. Neither is setting some subjective and needlessly high bar for...what exactly? What? Do we decide to pay attention to AI after it does all the above? That seems a bit late to the party for cheering on or resisting it.

Some new shit is afoot. Folk need to pay attention, not think they got it figured out already.

By compass_copium 2026-02-2618:081 reply

Programs are fundamentally lists of instructions. LLMs are very good at building these lists. That it performs well when you say "Build a list you've seen before, but do it in a slightly different way this time. Here's the exact way I want you to do it." is not surprising. I would honestly be surprised if it couldn't do it.

As the other commenter suggested, a genuinely novel scientific idea would be surprising. A new style of art (think Picasso or Pollack coming along), not just an iteration on Ghibli, would be surprising. That's actual creativity.

By selridge 2026-02-2618:23

>I would honestly be surprised if it couldn't do it.

You'd be surprised if an LLM couldn't write *any* program?

By orf 2026-02-268:12

That’s still over-general to the point of being useless.

What you wrote would apply to a human approaching this task as well, sans the “many trillion lines of code”.

By c22 2026-02-266:491 reply

> If you train a model that f(x) = 5 and f(x+1) = 6, the number 7 doesn't have to exist in the training data for the model to give you a correct answer for f(x+2)

This is an interesting claim to me. Are there any models that exist that have been trained with a (single digit) number omitted from the training data?

If such a model does exist, how does it represent the answer? (What symbol does it use for the '7'?)

By wavemode 2026-02-266:531 reply

When I say "model" here I'm referring to any statistical model (in this example, probably linear regression). Not specifically large language models / neural networks.

By c22 2026-02-266:592 reply

Gotcha, I don't think I know enough about it. What constitutes training data for a for a (non neural network) statistical model? Is this something I could play around with myself with pen and paper?

By nairboon 2026-02-268:06

Just the raw numbers? You list the y's and the x's and the model is approximating y=f(x) from the above example. You can totally do it with pen and paper. This is what it'd look like (for linear regression): https://observablehq.com/@yizhe-ang/interactive-visualizatio...

By heavyset_go 2026-02-268:01

You can write an f(x) and record the input and output and that can be your training data. Or just download some time-series data or something.

By Kim_Bruning 2026-02-267:27

If you run an LLM in an autoregressive loop you can get it to emulate a turing machine though. That sort of changes the complexity class of the system just a touch. 'Just predicts the next word' hits different when the loop is doing general computation.

Took me a bit of messing around, but try to write out each state sequentially, with a check step between each.

By ainch 2026-02-269:00

Sorry but this is famously not true! There is no guarantee that statistical models generalise. In your example, whether or not your model generalises depends entirely on what f(x) you use - depending on the complexity of your function class f(x+2) could be 7, 8, or -500.

One of the surprises of deep learning is that it can, sometimes, defy prior statistical learning theory to generalise, but this is still poorly understood. Concepts like grokking, double descent, and the implicit bias of gradient descent are driving a lot of new research into the underlying dynamics of deep learning. But I'd say it is pretty ahistoric to claim that this is obvious or trivial - decades of work studied "overfitting" and related problems where statistical models fail to generalise or even interpolate within the support of their training data.

By arkh 2026-02-268:101 reply

I expected (and still expect) a lot from LLM with cross disciplinary research.

I think they should be the perfect tool to find methods or results in a field which look like it could be used in another field.

By WithinReason 2026-02-268:35

This might actually be a limitation of the "predict next word" approach since the network is never trained to predict a result in one field from a result in another. It might still make the connection though, but not as easily.

By red75prime 2026-02-266:511 reply

I think the relevant question is: can a statistical model (or a transformer, in particular) generalize to general reasoning ability?

By selridge 2026-02-266:451 reply

Ok cool cool. Instead of pretending you need to teach me, you could engage with what I'm saying or even the OP!

"I don't know how you get here from "predict the next word"" is not really so much a statement of ignorance where someone needs you to step in but a reflection that perhaps the tech is not so easily explained as that. No magic needs to be present for that to be the case.

By wavemode 2026-02-267:221 reply

If you disagree with someone on the internet, you can just say "I disagree, and here's why". You don't have to aggressively accuse them of "not engaging" with the text.

I engaged. You just don't like what I wrote. That's okay.

By selridge 2026-02-267:30

Thanks but no thanks.

By anon7725 2026-02-266:431 reply

“Represented in the training data” does not mean “represented as a whole in the training data”. If A and B are separately in the training data, the model can provide a result when A and B occur in the input because the model has made a connection between A and B in the latent space.

By selridge 2026-02-266:521 reply

Yes. I’m saying that “it’s just in the training data” is a cognitive containment of these models which is incomplete. You can insist that’s what’s happening, but you’ll be left unable to explain what’s going on beyond truisms.

By WithinReason 2026-02-268:511 reply

It's called "generalization":

https://en.wikipedia.org/wiki/Generalization_(learning)

By selridge 2026-02-2618:34

>"If A and B are separately in the training data, the model can provide a result when A and B occur in the input because the model has made a connection between A and B in the latent space."

This statement (The one I was replying to) is fundamentally unbounded. There's nothing that can't be explained as a combination of "A" and "B" in "training data" because practically speaking we can express anything as such where the combination only needs to be convex along some high-dimensional semantic surface. Add on to that my scare quotes around "training data" because very few people have any practical idea of what is or isn't in there, so we can just make claims strategically. Do we need to explain a success? It was in the training data. A failure, probably not in the training data. Will anyone call us on this transparent farce? Not usually, no.

If a statement can--at will--explain everything and nothing, what's it worth?

By sasjaws 2026-02-267:114 reply

A while ago i did the nanogpt tutorial, i went through some math with pen and paper and noticed the loss function for 'predict the next token' and 'predict the next 2 tokens' (or n tokens) is identical.

That was a bit of a shock to me so wanted to share this thought. Basically i think its not unreasonable to say llms are trained to predict the next book instead of single token.

Hope this is usefull to someone.

By 317070 2026-02-267:422 reply

As an expert in the field: this is exactly right.

LLMs are trained to do whole book prediction, at training time we throw in whole books at the time. It's only when sampling we do one or a few tokens at the time.

By justinator 2026-02-268:014 reply

where do you get these books?

honking intensifies

WHERE DO YOU GET THESE BOOKS?!

By tasuki 2026-02-268:13

The local library.

By benterix 2026-02-268:23

We do things, but it doesn't feel right

By fc417fc802 2026-02-268:14

Can anyone even say what a book really is at the end of the day? It's such an abstract concept. /s

By TuringTest 2026-02-268:082 reply

Isn't that the same as compressing the whole book, in a special differential format that compares how the text looks from any given point before and after?

By 317070 2026-02-268:31

There are many ways to model how the model works in simpler terms. Next-word prediction is useful to characterize how you do inference with the model. Maximizing mutual information, compressing, gradient descent, ... are all useful characterisations of the training process.

But as stated above, next token prediction is a misleading frame for the training process. While the sampling is indeed happening 1 token at a time, due to the training process, much more is going on in the latent space where the model has its internal stream of information.

By margalabargala 2026-02-2618:20

Everything is the same as everything else. It's all just hydrogen and time mixed together.

By apexalpha 2026-02-269:461 reply

Are you referring to this one?: https://github.com/karpathy/build-nanogpt

By sasjaws 2026-02-288:47

Thats the one, lots of fun and a great entrypoint for experimentation.

By croon 2026-02-268:542 reply

Isn't that why noise was introduced (seed rolling/temperature/high p/low p/etc)? I mean it is still deterministic given the same parameters.

But this might be misleadingly interpreted as an LLM having "thought out an answer" before generating tokens, which is an incorrect conclusion.

Not suggesting you did.

By throw310822 2026-02-269:021 reply

> this might be misleadingly interpreted as an LLM having "thought out an answer"

I'm convinced that that is exactly what happens. Anthropic confirms it:

"Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so."

https://www.anthropic.com/research/tracing-thoughts-language...

By sasjaws 2026-02-289:161 reply

This is about reasoning tokens right? I didnt mean that, nanogpt doesnt do that. Nanogpt inference just outputs letters directly, no intermediate tokens.

By throw310822 2026-03-018:09

No, this is about normal tokens. While a SOTA LLM outputs a token at a time, it already has a high level plan of what it is going to say many tokens ahead. This is in reply to the GP who thinks that an LLM can somehow produce coherent and thoughtful sentences while never seeing more than one token ahead.

By sasjaws 2026-02-289:11

Thats actually an interesting way to look at it. However i just posted that because i often see articles expressing amazement at how training an llm at next token prediction can take it so far. Seemingly ontrasting the simplicity of the training task to the complexity of the outcome. The insight is that the training task was in fact 'predict the next book', just as much as it is 'predict the next token'. So every time i see that 'predict the next token' representation of the training task it rubs me the wrong way. Its not wrong, but misleading.

I didnt mean to suggest that is how it 'thinks ahead' but i believe you can see it like that in a way. Because it has been trained to 'predict all the following tokens'. So it learned to guess the end of a phrase just as much as the beginning. I consider the mechanism of feeding each output token back in to be an implementation detail that distracts from what it actually learned to do.

I hope this makes sense. Fyi im no expert in any way, just dabbling.

By sputknick 2026-02-267:153 reply

I'd like to explore this idea, did you make a blog post about it? is it simple enough to post in the reply?

By sasjaws 2026-02-289:31

No blog post, my llm expert friend told me this was kinda obvious when i shared it with him so i didnt think it was worth it.

I can tell you how i got there, i did nanogpt, then tried to be smart and train a model with a loss function that targets 2 next tokens instead of one. Calculate the loss function and you'll see its exactly the same during training.

Sibling commenter also mentions:

> the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation."

Hope that helps.

By WithinReason 2026-02-268:29

Look up attention masks

By krackers 2026-02-280:111 reply

Unless I've misunderstood the math myself, I don't think GPs comment is quite right if taken literally since "predict the next 2 tokens" would literally mean predict index t+1, t+2 off of the same hidden state at index t, which is the much newer field of multi-token prediction and not classic LLM autoregressive training.

Instead what GP likely means is the observation that the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation. So training with teacher forcing to minimize "next token" loss simultaneously across every prefix of the ground-truth is equivalent to maximizing the joint probability of that entire ground-truth sequence.

Practically, even though inference is done one token at a time, you don't do training "one position ahead" at a time. You can optimize the loss function for the entire sequence of predictions at once. This is due the autoregressive nature of the attention computation: if you start with a chunk of text, as it passes through the layers you don't just end up with the prediction for the next word in the last token's final layer, but _all_ of the final-layer residuals for previous tokens will encode predictions for their following index.

So attention on a block of text doesn't give you just the "next token prediction" but the simultaneous predictions for each prefix which makes training quite nice. You can just dump in a bunch of text and it's like you trained for the "next token" objective on all its prefixes. (This is convenient for training, but wasted work for inference which is what leads to KV caching).

Many people also know by now that attention is "quadratic" in nature (hidden state of token i attends to states of tokens 1...i-1), but they don't fully grasp the implication that even though this means for forward inference you only predict the "next token", for backward training this means that error for token i can backpropagate to tokens 1...i-1. This is despite the causal masking, since token 1 doesn't attend to token i directly but the hidden state of token 1 is involved in the computation of the residual stream for token i.

When it comes to the statement

>its not unreasonable to say llms are trained to predict the next book instead of single token.

You have to be careful, since during training there is no actual sampling happening. We've optimized to maximize the joint probability of ground truth sequence, but this is not the same as maximizing the probability the the ground truth is generated during sampling. Consider that there could be many sampling strategies: greedy, beam search, etc. While the most likely next token is the "greedy" argmax of the logits, the most likely next N tokens is not always found by greedily sampling N times. It's thought that this is one reason why RL is so helpful, since rollouts do in fact involve sampling so you provide rewards at the "sampled sequence" level which mirrors how you do inference.

It would be right to say that they're trained to ensure the most likely next book is assigned the highest joint probability (not just the most likely next token is assigned highest probability).

By sasjaws 2026-02-289:49

The idea i tried to express was purely the loss function thing you mentioned, and how both tasks (1 vs 2 vs n) lead to identical training runs. At least with nanogpt. I dont know if that extrapolates well to current llm internals and current training.

By pushedx 2026-02-265:482 reply

Yes, most people (including myself) do not understand how modern LLMs work (especially if we consider the most recent architectural and training improvements).

There's the 3b1b video series which does a pretty good job, but now we are interfacing with models that probably have parameter counts in each layer larger than the first models that we interacted with.

The novel insights that these models can produce is truly shocking, I would guess even for someone who does understand the latest techniques.

By auraham 2026-02-266:081 reply

I highly recommend Build a large language model from scratch [1] by Sebastian Raschka. It provides a clear explanation of the building blocks used in the first versions of ChatGPT (GPT 2 if I recall correctly). The output of the model is a huge vector of n elements, where n is the number of tokens in the vocabulary. We use that huge vector as a probability distribution to sample the next token given an input sequence (i.e., a prompt). Under the hood, the model has several building blocks like tokenization, skip connections, self attention, masking, etc. The author makes a great job explaining all the concepts. It is very useful to understand how LLMs works.

[1] https://www.manning.com/books/build-a-large-language-model-f...

By phreeza 2026-02-266:381 reply

But this is missing exactly the gap which OP seems to have, which is going from a next token predictor (a language model in the classical sense) to an instruction finetuned, RLHF-ed and "harnessed" tool?

By js8 2026-02-2610:17

The book has a sequel https://www.manning.com/books/build-a-reasoning-model-from-s...

It will give you an answer to the extent anybody can.

By measurablefunc 2026-02-265:581 reply

What's the latest novel insight you have encountered?

By brookst 2026-02-266:362 reply

Not the person you asked, and “novel” is a minefield. What’s the last novel anything, in the sense you can’t trace a precursor or reference?

But.. I recently had a LLM suggest an approach to negative mold-making that was novel to me. Long story, but basically isolating the gross geometry and using NURBS booleans for that, plus mesh addition/subtraction for details.

I’m sure there’s prior art out there, but that’s true for pretty much everything.

By measurablefunc 2026-02-266:434 reply

I don't know, that's why I asked b/c I always see a lot of empty platitudes when it comes to LLM praise so I'm curious to see if people can actually back up their claims.

I haven't done any 3D modeling so I'll take your word for it but I can tell you that I am working on a very simple interpreter & bytecode compiler for a subset of Erlang & I have yet to see anything novel or even useful from any of the coding assistants. One might naively think that there is enough literature on interpreters & compilers for coding agents to pretty much accomplish the task in one go but that's not what happens in practice.

By brookst 2026-02-267:081 reply

It’s taken me a while to get good at using them.

My advice: ask for more than what you think it can do. #1 mistake is failing to give enough context about goals, constraints, priorities.

Don’t ask “complete this one small task”, ask “hey I’m working on this big project, docs are here, source is there, I’m not sure how to do that, come up with a plan”

By measurablefunc 2026-02-267:22

The specification is linked in another comment in this thread & you can decide whether it is ambitious enough or not but what I can tell you is that none of the existing coding agents can complete the task even w/ all the details. If you do try it you will eventually get something that will mostly work on simple tests but fail miserably on slightly more complicated test cases.

By pushedx 2026-02-266:532 reply

Which agents are you using, and are you using them in an agent mode (Codex, Claude Code etc.)?

The difference in quality of output between Claude Sonnet and Claude Opus is around an order of magnitude.

The results that you can get from agent mode vs using a chat bot are around two orders of magnitude.

By measurablefunc 2026-02-267:122 reply

The workflow is not the issue. You are welcome to try the same challenge yourself if you want. Extra test cases (https://drive.proton.me/urls/6Z6557R2WG#n83c6DP6mDfc) & specification (https://claude.ai/public/artifacts/5581b499-a471-4d58-8e05-1...). I know enough about compilers, bytecode VMs, parsers, & interpreters to know that this is well within the capabilities of any reasonably good software engineer but the implementation from Gemini 3.1 Pro (high & low) & Claude Opus 4.6 (thinking) have been less than impressive.

By pushedx 2026-02-267:54

sorry, needed to edit this comment to ask the same question as the sibling:

have you run these models in an agent mode that allows for executing the tests, the agent views the output, and iterates on its own for a while? up to an hour or so?

you will get vastly different output if you ask the agent to write 200 of its own test cases, and then have it iterate from there

By Kim_Bruning 2026-02-267:521 reply

Possibly a dumb question: but are you running this in claude code, or an ide, or basically what are you using to allow for iteration?

By measurablefunc 2026-02-269:041 reply

I'm using Google's antigravity IDE. I initially had it configured to run allowed commands (cargo add|build|check|run, testing shell scripts, performance profiling shell scripts, etc.) so that it would iterate & fix bugs w/ as little intervention from me as possible but all it did was burn through the daily allotted tokens so I switched to more "manual" guidance & made a lot more progress w/o burning through the daily limits.

What I've learned from this experiment is that the hype does not actually live up to the reality. Maybe the next iteration will manage the task better than the current one but it's obvious that basic compiler & bytecode virtual machine design in a language like Rust is still beyond the capabilities of the current coding agents & whoever thinks I'm wrong is welcome to implement the linked specification to see how far they can get by just "vibing".

By Kim_Bruning 2026-02-269:281 reply

That's roughly where I'm at too. I have seen people have some more success after having practices though. Possibly the actual workflows needed for full auto are still kind of tacit. Smaller green-field projecs do work for me already though.

By measurablefunc 2026-02-2620:42

In my experience a few hundred lines w/ a few crates w/ well-defined scopes & a detailed specification is within current capabilities, e.g. compressing wav files w/ wavelets & arithmetic coding. But it's obvious that a correct parser, compiler, & bytecode VM is still beyond current agents even if the specification is detailed enough to cover basically everything.

By kmaitreys 2026-02-267:45

Can you clarify a bit more about the this two orders of magnitude? In what context? Sure, they have "agency" and can do more than outputting text, but I would like see a proper example of this claim.

By joquarky 2026-02-2622:031 reply

Most humans can't force themselves to come up with something novel immediately upon demand.

By measurablefunc 2026-02-2622:18

Completely unrelated to the topic or any of the points I was making so did you get confused & respond to the wrong thread?

By kennyloginz 2026-02-267:461 reply

There is prior art, so it’s not novel.

By brookst 2026-02-2615:481 reply

Great. Can you point to anything at all that is truly novel, no prior art?

By rsync 2026-02-2618:43

Sliding down handrails on a skateboard.