Where Is GPT in the Chomsky Hierarchy?

2025-12-1422:375576fi-le.net

A mathematical argument showing why GPT, even with infinite context, cannot be Turing complete—and why that’s actually a feature.

14th of December, 2025

The Chomsky hierarchy is a way to classify text-generating algorithms (formally called languages) by how expressive they are. Since generative pretrained transformers, GPTs, are getting quite a bit of attention these days, one might wonder where in the hierarchy they fall. To give a classic example of one category in the hierarchy, context-free languages are those whose strings are produced by iteratively applying rules that locally expand grammar particles into words. This covers practically all programming languages, because they conform to a syntax tree where each branch is an application of such a production rule. A subcategory of context-free languages are n-gram languages, where the possible continuations of a string depend only on a bounded number of words that came before in the sentence (whereas the syntax tree of a context-free language could have unboundedly many branches).

Human language is of course much more complicated than these simple classes of languages. One argument for this could be that humans are generally intelligent, and can therefore simulate a Turing machine, which at least in principle gives us the power to understand any computable language. This is a strictly more expressive category of all the languages we talked about above. A more prosaic argument might describe concrete kinds of sentences that provably cannot be produced by, say, context-free grammars, which I will leave to linguistics textbooks.

Surely then, a system like GPT, that understands human language so well that it can write perfect sonnets and win math competitions, must also be in this same rung in the Chomsky hierarchy as humans, by being Turing complete?

First, we could be cheeky and say that, because GPT has a finite context window with finitely many possible tokens at each position, it can trivially only model n-gram languages. This is true, but not really in the spirit of the question. By the same logic, humans would be finite-state machines because the molecules in our brains within a finite space can only occupy finitely many combinations of Planck volumes. Let's grant our GPT an infinite context-window.

Conversely, some assumptions are too generous to describe real-world GPTs. For example, Pérez, Barceló, and Marinkovic (2021) prove that transformers are in fact Turing complete if we allow adding new layers as the input sequence length grows.

There is a simple argument showing that GPTs, even with an infinite context window, cannot be Turing complete. Since the vocabulary size is finite, the rows of the embedding matrix are bounded, meaning each embedding is in a compact subset of Rn. By Tychonoff's Theorem, the product of the compact spaces the input embeddings live in is also compact. Because the transformer architecture does a bounded number of continuous operations, the output probabilities are bounded from above and below. In particular, the probability of emitting the end-of-sequence token (the special token that halts the completion) is bounded from below by some constant, so the probability of stopping in finite time is 1.

Some readers will have noticed the above argument describes a slightly different GPT from that which is actually deployed, but the intuition does apply to transformers in general. While other architectures like RNNs can do an unbounded number of computations for inference, transformers cannot do so, even in theory. In practice, transformers rely on strange crutches like thinking tokens (e.g. Deepseek-r1 likes to output: "But wait, …") to do what might be called approximating Turing completeness. Similar but more involved arguments prove that commonly used GPT architectures without chain-of-thought cannot learn a generalizable algorithm for the most simple problems, such as telling apart odd numbers of symbols ("yyy") from even ones ("yyyy") (Hahn 2020).

But please do not get any ideas about going back to recurrent networks. This limitation of transformers is precisely the reason why GPTs are so easy to train, and lesser expressiveness of the transformer is a feature, not a bug. The fact that the length of the computational path between the input tokens and the output probabilities is bounded is what allows parallelized training (see also Merrill and Sabharwal 2023).

I claim this awkwardness is central to the disagreement between those who believe GPTs are sufficient for automating most of current human labor, and those who believe only a fundamentally different architecture is needed. Maybe the Chomsky hierarchy is just a bad way to think about computation, GPTs are fine, and we humans also only ever do fixed-depth approximations to Turing completeness. But maybe unbounded computation path lengths are needed to learn algorithms that generalize like we can. It could be the parallel peril that trumpets the end of the scaling paradigm.

    (1) Reader correction: Programming langauges also use a lot of context-sensitive syntactic sugar. The point is that there exists a still somewhat idiomatic subset of that language that uses only a context-free grammar.


Read the original article

Comments

  • By PaulHoule 2025-12-1918:406 reply

    The Chomsky Hierarchy isn't the right way to think about, in fact the Chomsky paradigm probably held back progress in language technology over the past 70 years.

    In the Kuhnian sense, Syntactic Structures was the vanguard of a paradigm shift that let linguists do "normal science" in terms of positing a range of problems and writing papers about them. It was also useful in computer science for formally defining computer languages that are close enough to natural languages that people find them usable.

    On the other hand, attempts to use the Chomsky theory to develop language technology failed miserably outside of very narrow domains and in the real world Markov models and conditional random fields often outperformed when the function was "understanding", "segmentation", etc.

    From an engineering standpoint, the function that tells if a production is in the grammar that is central to the theory is not so interesting for language technology, I mean, if LLMs were -centric then an LLM would go on strike if you got the spelling or grammar wrong or correct you in a schoolmarmish William Safire way but rather it is more useful for LLMs to silently correct for obvious mistakes the same way that real language users do.

    The other interesting thing is the mixing of syntax and semantics. For a long time I believed the "Language Instinct" idea of Pinker in which the ability to serialize and deserialize language is a peripheral that is attached to an animal and you need the rest of the animal and even the surrounding world to have "human" competence. Now LLMs come around and manage to show a lot of linguistic competence without the rest of the animal and boy we have egg on our face, and coming out of the shadow of the Chomsky theory, structuralism will rear it's head again. (e.g. "X is structured like a language" got hung up on the unspoken truth that "language is not structured like a language")

    • By nabla9 2025-12-1920:444 reply

      You saw name "Noam Chomsky" and that started a process in your mind that generated the standard spiel about Syntactic Structures.

      Chomsky Hierarchy is his more fundamental work that joins computer science and linguistics. It was published in IRE Transactions on Information Theory. Chomsky, Noam (1956). "Three models for the description of language" https://chomsky.info/wp-content/uploads/195609-.pdf

      Type-3 grammar ≡ finite-state automaton

      Type-2 grammar ≡ Non-deterministic pushdown automaton

      Type-1 grammar ≡ Linear-bounded non-deterministic Turing machine

      Type-0 grammar ≡ Turing machine

      ps. Chomsky was already aware of Finite State Automata and Turing Machines and understood that they match Type-3 and Type-0. Pushdown Automata was invented later, and connection between Type-1 grammar and Linear Bounded Automata was made few years later.

      • By PaulHoule 2025-12-2016:381 reply

        An LLM is not type 0. It always finishes in finite time so it is not Turing complete.

        I asked Copilot

           answer yes or no,  is the following Java method syntactically well formed
           
           static int aMethod() { return "5"; }
        
        and got what I thought was the wrong answer

           No.
           It’s syntactically valid as Java code, but it will not compile because it returns a String where 
           an int is required.
        
        because I hadn't specified clearly that I was talking about the type 2 CFG of the parser as opposed to the type 1 behavior of the compiler as a whole. [1] I had a good conversation with copilot about it and I'm sure I'd get better results with a better prompt... It would make a good arXiv paper to pose an LLM grammatical recognition problems by prompting

           here is a set of rules: ...  is the production ... in the grammar?
        
        with a wide range of cases. Somebody who just tries a few examples might be impressed by its capability but if you were rigorous about it you would conclude that an LLM pretends to be able to recognize grammars but can't actually do it.

        And that's true about everything they do, one could argue that "in an exact sense, LLM can't do anything at all") They'll make a try at a weird question like

           what percent of North Americans would recognize the kitsune hand gesture?
        
        which is a public opinion research question similar in character to

           what is lowest mass eigenstate of the neutrino?
        
        in that it could be answered rigorously (but still in terms of probability, even hep results have p-values)

        [1] javac implements type 1 behavior in the java language which is a substrate

        • By lostmsu 2026-01-023:40

          > It always finishes in finite time so it is not Turing complete.

          This is incorrect. The simplest LLM inference is basically a `do { choose_word(...); } until (word == '<special stop word>');` loop. That loop may never stop.

      • By xg15 2025-12-1921:001 reply

        I think it could be useful to combine the two paradigms to maybe get a better understanding of what transformers can and cannot learn.

        E.g. would it be possible to create an algorithm that takes a grammar (and maybe a desired context window size) as input and constructs a transformer network that generates sentences exactly from that grammar?

        ("Construct" meaning directly setting the weights, without any iterative training process)

        • By nabla9 2025-12-1921:081 reply

          They are combined. Chomsky Hierarchies are the core of modern Computer science because they map perfectly into automata theory. They are always taught together in computer science.

          >E.g. would it be possible to create an algorithm that takes a grammar (and maybe a desired context window size) as input and constructs a transformer network that generates sentences exactly from that grammar?

          You don't need transformers for what you describe. That's 101 theory of computation class where you learn about automata, grammars, parsers, and generators.

          • By xg15 2025-12-1921:181 reply

            Yeah, I know the theory of formal grammars and automata that the Chomsky hierarchy is part of. What I meant is that language models and specifically transformer networks are usually entirely separate from that theory, so it would be useful to build a bridge between "modern" language processing using GPTs/LLMs and the classical formal theory.

            The most obvious overlap in usage is with programming languages: LLMs can parse and generate code in formal languages, but their processing model is completely different from syntax trees and parsers. So the question is, how do they store the formal structure of a programming language and could this be mapped back in any way to a grammar or automaton?

            • By PaulHoule 2025-12-201:05

              The way I see it is that attention is graph structured -- this token here is connected to that token there and so forth by the attention lighting up or also in the sense that there are a bunch of places in the document where people are talking about "Noam" or "Chomsky" or "Noam Chomsky" or "The Author" or "him", etc.

              Alternately if you were looking it from a semantic web perspective the knowledge expressed in a document is a graph and that graph structure is more fundamental than the tree structure of a text because you could express the same knowledge in different orders. Serialization fundamentally requires putting things in some specific order which might specifically be chronological (work from 2005 to the present as a play with many acts) or could be organized around some conceptual hierarchy (kitsune legends, self psychology, character acting, animal and human behavior and physiology, ...) or the minimization or elimination of backward references (whatever it is that the C spec does a touch wrong but post-Common Lisp specs do right) , etc. Ultimately the graph is pruned away into a tree where the remaining links are denoted by syntactic features in the local scale of the document, you're kinda left filling in the rest of the links with some combination of pragmatics, logical inference, something like SAT solving, etc,

              A conventional parsing point of view sees a Java program as a tree but for ordinary purposes it does not matter what order you put the fields and methods in and even though procedural programs are allegedly a sequence of operations done in a certain order it is frequently the case that it does not matter at all if you run line 71 or line 75 first so it is often the case that the graph is the real thing and the trees that we're so comfortable with are the shadows on the walls of the cave.

      • By yongjik 2025-12-1922:371 reply

        Chomsky Hierarchy pertains to "languages" defined in the theory of computation: i.e., it is a subset of the set of all finite sequence of alphabets (for some fixed notion of "alphabets"). If a sentence (a particular finite sequence of alphabets) is in the subset, then it is a "valid" sentence of the language. Otherwise it is invalid.

        It should be already clear from this that this notion of language is rather different from natural languages. For example, if there is a formal language that contains "Good morning" and "My hovercraft is full of eels" as valid sentences, then nothing distinguishes these sentences any more. (Of course you could add annotations and build semantic values but they are not essential to the discussion of formal languages.)

        It gets a bit more ridiculous when you try to connect LLMs to the Chomsky hierarchy. Modern LLMs do not really operate on the principle of "is this a valid sentence?" yet provide vastly superior results when it comes to generating naturally sounding sentences.

        I think LLMs have put an end to any hope that formal language theory (in the style of Chomsky Hierarchy) will be relevant to understanding human languages.

        • By littlestymaar 2025-12-206:19

          > For example, if there is a formal language that contains "Good morning" and "My hovercraft is full of eels" as valid sentences, then nothing distinguishes these sentences any more.

          Mind explaining a bit? Because I've no idea what you mean.

      • By Legend2440 2025-12-1921:151 reply

        The trouble is that english doesn’t fit neatly into any of these categories. It has features that make it at least a context-free language, but can’t handle other features of context-free languages like unlimited nesting.

        Ultimately these are categories of formal languages, and natural language is an entirely different kind of thing.

        • By nabla9 2025-12-1921:421 reply

          Strictly speaking natural languages fit into Context-Sensitive (Type-1) in Chomsky Hierarchy, but that's too broad to be useful.

          In practice they are classified into MCSL (Mildly Context-Sensitive) subcategory defined by Aravind K. Joshi.

          • By krnlclnl 2025-12-1922:092 reply

            Sure, if you accept and agree with Joshi.

            No reason to do that though, except to validate some random persons perspective on language. The sky will not open and smash us with a giant foot if we reject such an obligation.

            • By nabla9 2025-12-208:481 reply

              Natural languages being in MCSL (Mildly Context-Sensitive) is the consensus among linguistics, not some random individual's viewpoint.

              • By krnlclnl 2025-12-212:38

                OK? What concrete human problems human biology faces are resolved by this groups consensus? Obsession with notation does little to improve crop yields, or improve working conditions for the child labor these academic geniuses rely on.

                Sure, linguists, glad you found some semantics that fit your obsession. Happy for you!

                Most people will never encounter their work and live their lives never knowing such an event happened.

            • By suddenlybananas 2025-12-2011:201 reply

              You can also reject quantum physics and the sky will not open and smash us with a giant foot. However, to do so without serious knowledge of physics would be quite dumb.

              • By krnlclnl 2025-12-212:301 reply

                Apples and oranges. Language emerges from human biology which emerges from the physical realm. In the end language emerges then from the physical realm. Trying to de-couple it from physical nature and make it an abstract thought bubble is akin to bike shedding in programming.

    • By czzr 2025-12-1920:03

      I don’t think LLMs contradict the Pinker description - it just turns out that the output stream embodies a lot of information, and you can construct a model (of some reasonable fidelity) of the original sources from sufficient examples of the output stream.

    • By ogogmad 2025-12-1919:141 reply

      I think most PL parsers use recursive descent, which doesn't require any formal language theory. That said, it's possible that formal language theory has enabled regular expressions to enter widespread use - first via theory; then to make tokenisers; then included in text editors; and then to make GREP.

      • By Maxatar 2025-12-1919:271 reply

        You're going to have a very hard time parsing a language using recursive descent if you don't understand some formal language theory. Recursive descent algorithms have a lot of trouble with left-recursion, and can result in arbitrary backtracking if you're not careful, which can blow up memory and time. To fix left-recursion you need to rewrite your grammar, which means familiarity with formal language theory. To avoid backtracking you typically use an LL(N) or variations thereof, which once again require having a pretty good understanding of formal language theory.

        • By FeepingCreature 2025-12-1919:592 reply

          Nah. Source: have done it without formal language theory. You can do most things by printf debugging and kludging.

          (Left-recursion? Use a while loop! Mutable state is a pathway to many abilities functional programmers consider unnatural.)

          • By PaulHoule 2025-12-1920:48

            You can definitely write parsers and lexers knowing just enough to be dangerous.

            I remember being in the PHP culture where people wrote regex-based hacks instead of the character-at-a-time or token-at-a-time state machines or state machine + a stack kind of things which would be too slow in a language like that.

            The less you know the faster you get in trouble when things get tough.

          • By Maxatar 2025-12-1921:101 reply

            My statement was specific to recursive descent parsing.

            • By FeepingCreature 2025-12-201:22

              But recursive descent parsing is implemented with functions operating on parser state. The whole point of it is that it's very easy to write a parser because the abstraction exactly matches to function calling in programming languages. So you can do the whole range of pl primitives too.

    • By suddenlybananas 2025-12-1920:35

      > For a long time I believed the "Language Instinct" idea of Pinker in which the ability to serialize and deserialize language is a peripheral that is attached to an animal and you need the rest of the animal and even the surrounding world to have "human" competence.

      Yeah you just need 10 trillion tokens of data.

    • By jesuslop 2025-12-202:12

      In a complementary story, from light reading I gather that Zellig Harris, that advised Chomsky's dissertation, contributed decisively to distributional semantics, a forerunner of word embeddings, in turn a key to the door of neural NLP. The ML guys had a shiny toolbox ready to use for decades. Had it been more fashionable...

      Shalgren, Distributional Legacy: The Unreasonable Effectiveness of Harris’s Distributional Program. https://doi.org/10.1080%2F00437956.2024.2414515

    • By DiscourseFan 2025-12-1919:431 reply

      Don't worry, the LLMs are already post-structuralist (at least the ones working on them are often familiar with those thinkers). Which is incredibly disturbing, I think, for many linguists who have spent their entire careers fighting against the French, only to have such philosophical and intellectual work become the most important in the development of this new technology. Hey, at least I'm employed! (And they said it "wasn't serious," that Chomskian Linguistics was the "real science"!)

      • By lukev 2025-12-1920:032 reply

        Well, technically, they're more structuralist than post-structuralist. Arguably they even prove that Saussure's original structuralist hypothesis was correct.

        • By PaulHoule 2025-12-201:10

          It frustrates me to no end because I can't help but thing Saussure is a charlatan, and that's a person who loves the early Foucault and Badiou and Baudrillard [1] but I'm not so sure about Dubord and Barthes and Derrida.

          [1] ... as a guilty pleasure not in the "slumming" sense of Backstabbed in a Backwater Dungeon but that I feel that he, like Erving Goffman, brings out the spiritual desolation in me

        • By DiscourseFan 2025-12-1920:501 reply

          I really do not like posting anything about my work on this website. Not only am I extremely constrained in what I can say, many engineers seem to vilify any attempt to legitimate the business. If I could speak freely there'd be no end to what I would tell you to the contrary.

          • By lukev 2025-12-1922:291 reply

            Well, I'm very interested in this, and if you have things you wish you could speak freely about I can be reached directly via several mediums. My name is not hard to find and I'm on Bluesky & LinkedIn if you want to reach out to me there. I'm one of the authors of this book if you want my full name to search: https://www.amazon.com/Practical-Clojure-Experts-Voice-Sourc...

            • By DiscourseFan 2025-12-2023:071 reply

              Nobody who works on these things will be able to say anything for at least a decade, but trust me there will be textbooks.

              • By lukev 2025-12-2922:17

                Well since you're intent on being so mysterious, would you at least drop a link to where someone with interest in training in philosophy, linguistics & LLMs might apply those skills together?

                Unless it's Palantir or somesuch being grandiose about how much they're going to manipulate society, in which case fuck all the way off into the sun.

  • By pohl 2025-12-1919:091 reply

    Regular...albeit astronomically large (unless we're granting idealizations like infinite context, etc.)

    • By Netcob 2025-12-1920:161 reply

      Exactly, same as all real-world computers.

      Although to be fair, nothing above regular (that I'm aware of, it's been a while) requires infinite space, just unlimited space... you can always make a real Turing machine as long as you keep adding more tape whenever it needs it.

      • By pohl 2025-12-1921:14

        Yeah, meant to say unbounded

  • By guerrilla 2025-12-1919:21

    This is a trivial category mistake.

HackerNews