
I polished a Markov chain generator and trained it on an article by Uri Alon and al (https://pmc.ncbi.nlm.nih.gov/articles/PMC7963340/).
It generates text that seems to ...
It generates text that seems to me at least on par with tiny LLMs, such as demonstrated by NanoGPT. Here is an example:
jplr@mypass:~/Documenti/2025/SimpleModels/v3_very_good$
./SLM10b_train UriAlon.txt 3
Training model with order 3...
Skip-gram detection: DISABLED (order < 5)
Pruning is disabled
Calculating model size for JSON export...
Will export 29832 model entries
Exporting vocabulary (1727 entries)...
Vocabulary export complete.
Exporting model entries...
Processed 12000 contexts, written 28765 entries (96.4%)...
JSON export complete: 29832 entries written to model.json
Model trained and saved to model.json
Vocabulary size: 1727
jplr@mypass:~/Documenti/2025/SimpleModels/v3_very_good$ ./SLM9_gen model.json
Aging cell model requires comprehensive incidence data. To obtain such a large medical database of the joints are risk factors. Therefore, the theory might be extended to describe the evolution of atherosclerosis and metabolic syndrome. For example, late‐stage type 2 diabetes is associated with collapse of beta‐cell function. This collapse has two parameters: the fraction of the senescent cells are predicted to affect disease threshold . For each individual, one simulates senescent‐cell abundance using the SR model has an approximately exponential incidence curve with a decline at old ages In this section, we simulated a wide range of age‐related incidence curves. The next sections provide examples of classes of diseases, which show improvement upon senolytic treatment tends to qualitatively support such a prediction. model different disease thresholds as values of the disease occurs when a physiological parameter ϕ increases due to the disease. Increasing susceptibility parameter s, which varies about 3‐fold between BMI below 25 (male) and 54 (female) are at least mildly age‐related and 25 (male) and 28 (female) are strongly age‐related, as defined above. Of these, we find that 66 are well described by the model as a wide range of feedback mechanisms that can provide homeostasis to a half‐life of days in young mice, but their removal rate slows down in old mice to a given type of cancer have strong risk factors should increase the removal rates of the joint that bears the most common biological process of aging that governs the onset of pathology in the records of at least 104 people, totaling 877 disease category codes (See SI section 9), increasing the range of 6–8% per year. The two‐parameter model describes well the strongly age‐related ICD9 codes: 90% of the codes show R 2 > 0.9) (Figure 4c). This agreement is similar to that of the previously proposed IMII model for cancer, major fibrotic diseases, and hundreds of other age‐related disease states obtained from 10−4 to lower cancer incidence. A better fit is achieved when allowing to exceed its threshold mechanism for classes of disease, providing putative etiologies for diseases with unknown origin, such as bone marrow and skin. Thus, the sudden collapse of the alveoli at the outer parts of the immune removal capacity of cancer. For example, NK cells remove senescent cells also to other forms of age‐related damage and decline contribute (De Bourcy et al., 2017). There may be described as a first‐passage‐time problem, asking when mutated, impair particle removal by the bronchi and increase damage to alveolar cells (Yang et al., 2019; Xu et al., 2018), and immune therapy that causes T cells to target senescent cells (Amor et al., 2020). Since these treatments are predicted to have an exponential incidence curve that slows at very old ages. Interestingly, the main effects are opposite to the case of cancer growth rate to removal rate We next consider the case of frontline tissues discussed above.A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material. There just isn't enough variation in sentences.
But then, Markov Chains fall apart when the source material is very large. Try training a chain based on Wikipedia. You'll find that the resulting output becomes incoherent garbage. Increasing the context length may increase coherence, but at the cost of turning into just simple regurgitation.
In addition to the "attention" mechanism that another commenter mentioned, it's important to note that Markov Chains are discrete in their next token prediction while an LLM is more fuzzy. LLMs have latent space where the meaning of a word basically exists as a vector. LLMs will generate token sequences that didn't exist in the source material, whereas Markov Chains will ONLY generate sequences that existed in the source.
This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.
>Markov Chains will ONLY generate sequences that existed in the source.
A markov chain of order N will only generate sequences of length N+1 that were in the training corpus, but it is likely to generate sequences of length N+2 that weren't (unless N was too large for the training corpus and it's degenerate).
Well yeah, but N+2 but the generation of the +2 loses the first part of N.
If you use a context window of 2, then yes, you might know that word C can follow words A and B, and D can follow words B and C, and therefore generate ABCD even if ABCD never existed.
But it could be that ABCD is incoherent.
For example, if A = whales, B = are, C = mammals, D = reptiles.
"Whales are mammals" is fine, "are mammals reptiles" is fine, but "Whales are mammals reptiles" is incoherent.
The longer you allow the chain to get, the more incoherent it becomes.
"Whales are mammals that are reptiles that are vegetables too".
Any 3-word fragment of that sentence is fine. But put it together, and it's an incoherent mess.
That are reptiles!
Right, you can generate long sentences from a first-order markov model, and all of the transitions from one word to the next be in the training set but the full generated sentence may not.
> A markov chain of order N will only generate sequences of length N+1 that were in the training corpus
Depends on how you trained it, an LLM is also a markov chain.
> The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.
I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".
I am quite confused right now. Could you please help me with this?
Somewhat related: I like the work of David Hume, and he explains it quite well how we can imagine various creatures, say, a pig with a dragon head, even if we have not seen one ANYWHERE. It is because we can take multiple ideas and combine them together. We know how dragons typically look like, and we know how a pig looks like, and so, we can imagine (through our creativity and combination of these two ideas) how a pig with a dragon head would look like. I wonder how this applies to LLMs, if they even apply.
Edit: to clarify further as to what I want to know: people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?
Well, there's kind of two answers here:
1. To the extent that creativity is randomness, LLM inference samples from the token distribution at each step. It's possible (but unlikely!) for an LLM to complete "pig with" with the token sequence "a dragon head" just by random chance. The temperature settings commonly exposed control how often the system takes the most likely candidate tokens.
2. A markov chain model will literally have a matrix entry for every possible combination of inputs. So a 2 degree chain will have n^2 weights, where N is the number of possible tokens. In that situation "pig with" can never be completed with a brand new sentence, because those have literal 0's in the probability. In contrast, transformers consider huge context windows, and start with random weights in huge neural network matrices. What people hope happens is that the NN begins to represent ideas, and connections between them. This gives them a shot at passing "out of distribution" tests, which is a cornerstone of modern AI evaluation.
> A markov chain model will literally have a matrix entry for every possible combination of inputs.
The less frequent prefixes are usually pruned away and there is a penalty score to add to go to the shorter prefix. In the end, all words are included into the model's prediction and typical n-gram SRILM model is able to generate "the pig with dragon head," also with small probability.Even if you think about Markov Chain information as a tensor (not matrix), the computation of probabilities is not a single lookup, but a series of folds.
A markov chain model does not specify the implementation details of the function that takes a previous input (and only a previous input) and outputs a probability distribution. You could put all possible inputs into an llm (there's finitely many) and record the resulting output from each input in a table. "Temperature" is applied to the final output, not inside the function.
Re point 1: no, "temperature" is not an inherent property of LLM's.
The big cloud providers use the "temperature" setting because having the assistant repeat to you the exact same output sequence exposes the man behind the curtain and breaks suspension of disbelief.
But if you run the LLM yourself and you want the best quality output, then turning off "temperature" entirely makes sense. That's what I do.
(The downside is that the LLM can then, rarely, get stuck in infinite loops. Again, this isn't a big deal unless you really want to persist with the delusion that an LLM is a human-like assistant.)
I mostly agree with your intuition, but I’d phrase it a bit differently.
Temperature 0 does not inherently improve “quality”. It just means you always pick the highest probability token at each step, so if you run the same prompt n times you will essentially get the same answer every time. That is great for predictability and some tasks like strict data extraction or boilerplate code, but “highest probability” is not always “best” for every task.
If you use a higher temperature and sample multiple times, you get a set of diverse answers. You can then combine them, for example by taking the most common answer, cross checking details, or using one sample to critique another. This kind of self-ensemble can actually reduce hallucinations and boost accuracy for reasoning or open ended questions. In that sense, somewhat counterintuitively, always using temperature 0 can lead to lower quality results if you care about that ensemble style robustness.
One small technical nit: even with temperature 0, decoding on a GPU is not guaranteed to be bit identical every run. Large numbers of floating point ops in parallel can change the order of additions and multiplications, and floating point arithmetic is not associative. Different kernel schedules or thread interleavings can give tiny numeric differences that sometimes shift an argmax choice. To make it fully deterministic you often have to disable some GPU optimizations or run on CPU only, which has a performance cost.
I’m working on a new type of database. There are parts I can use an LLM to help with, because they are common with other databases or software. Then there are parts it can’t help with, if I try, it just totally fails in subtle ways. I’ve provided it with the algorithm, but it can’t understand that it is a close variation of another algorithm and it shouldn’t implement the other algorithm. A practical example, is a variation of Paxos that only exists in a paper, but it consistently it will implement Paxos instead of this variation, no matter what you tell it.
Even if you point out that it implemented vanilla Paxos, it will just go “oh, you’re right, but the paper is wrong; so I did it like this instead”… the paper isn’t wrong, and instead of discussing the deviation before writing, it just writes the wrong thing.
> I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".
People who claim this usually don’t bother to precisely (mathematically) define what they actually mean by those terms, so I doubt you will get a straight answer.
How can anyone "mathematically" define "revolutionary"?
LLMs have the ability to learn certain classes of algorithms from their datasets in order to reduce errors when compressing their pretraining data. If you are technically inclined, read the reference: https://arxiv.org/abs/2208.01066 (optionally followup work) to see how llms can pick up complicated algorithms from training on examples that could have been generated by such algorithms (in one of the cases the LLM is better than anything we know; in the rest it is simply just as good as our best algos). Learning such functions from data would not work with Markov chains at any level of training. The LLMs in this study are tiny. They are not really learning a language, but rather how to perform regression.
Transformers are performing (soft, continuous) beam search inside them, the width of beam being not bigger than number of k-v pairs in attention mechanism.
In my experience, having a Markov Chain to be equipped with the beam search greatly improve MC's predictive power, even if Markov Chain is ARPA 3-gram model, heavily pruned.
What is more, Markov Chains are not restricted to immediate prefixes, you can use skip grams as well. How to use them and how to mix them into a list of probabilities is shown in the paper on Sparse Non-negative Matrix Language Modeling [1].
[1] https://aclanthology.org/Q16-1024/
I think I should look into that link of yours later. Have slimmed over it, I should say it... smells interesting at some places. For one example, decision trees learning is performed with greedy algorithm which, I believe, does not use oblique splits whereas transformers inherently learn oblique splits.
> LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus
That's not true. Or at least it's only a true as for a human that read all the books in the world. That human only has seen that training data. But somehow it can come up with the Higgs Boson, or whatever.
well the people who did the Higgs boson theory worked and re-worked for years all the prior work about elementary particles and arguably did a bunch of re-mixing of all the previous “there might be a new elementary particle here!” work until they hit on something that convinced enough peers that it could be validated in a real-world experiment.
by which i mean to say that it doesn’t seem completely implausible that an llm could generate the first tentative papers in that general direction. perhaps one could go back and compute the likelihood of the first papers on the boson given only the corpus to date before it as researchers seem to be trying to do with the special relativity paper which is viewed as a big break with physics beforehand.
Here's how I see it, but I'm not sure how valid my mental model is.
Imagine a source corpus that consists of:
Cows are big. Big animals are happy. Some other big animals include pigs, horses, and whales.
A Markov chain can only return verbatim combinations. So it might return "Cows are big animals" or "Are big animals happy".
An LLM can get a sense of meaning in these words and can return ideas expressed in the input corpus. So in this case it might say "Pigs and horses are happy". It's not limited to responding with verbatim sequences. It can be seen as a bit more creative.
However, LLMs will not be able to represent ideas that it has not encountered before. It won't be able to come up with truly novel concepts, or even ask questions about them. Humans (some at least) have that unbounded creativity that LLMs do not.
> However, LLMs will not be able to represent ideas that it has not encountered before. It won't be able to come up with truly novel concepts, or even ask questions about them. Humans (some at least) have that unbounded creativity that LLMs do not.
There's absolutely no evidence to support this claim. It'd require humans to exceed the Turing computable, and we have no evidence that is possible.
> A Markov chain can only return verbatim combinations. So it might return "Cows are big animals" or "Are big animals happy".
Just for my own edification, do you mean "Are big animals are happy"? "animals happy" never shows up in the source text so "happy" would not be a possible successor to "animals", correct?
Please forgive me. I am not a Markov chain.
> However, LLMs will not be able to represent ideas that it has not encountered before.
Sure they do. We call them hallucinations and complain that they're not true, however.
Hmmm. Didn't think about that.
In people there is a difference between unconscious hallucinations vs. intentional creativity. However, there might be situations where they're not distinguishable. In LLMs, it's hard to talk about intentionality.
I love where you took this.
Hallucinations are not novel ideas. They are novel combinations of tokens constrained by learned probability distributions.
I have mentioned Hume before, and will do so again. You can combine "golden" and "mountain" without seeing a golden mountain, but you cannot conjure "golden" without having encountered something that gave you the concept.
LLMs may generate strings they have not seen, but those strings are still composed entirely from training-derived representations. The model can output "quantum telepathic blockchain" but each token's semantic content comes from training data. It is recombination, not creation. The model has not built representations of concepts it never encountered in training; it is just sampling poorly constrained combinations.
Can you distinguish between a false hallucination and a genuinely novel conceptual representation?
Or, 10000000s times a day while coding all over the world and it hallucinating something it never saw before which turned out to be the thing needed.
It's not quite that they cannot do anything not in the training data. They can also interpolate the training data. They're just fairly bad at extrapolating.
> we can imagine various creatures, say, a pig with a dragon head, even if we have not seen one ANYWHERE. It is because we can take multiple ideas and combine them together.
Funny choice of combination, pig and dragon, since Leonardo Da Vinci famously imagined dragons themselves by combining lizards and cats: https://i.pinimg.com/originals/03/59/ee/0359ee84595586206be6...
Hah, interesting. Pig and dragon just sort of came to mind as I was writing the comment. :D But we can pretty much imagine anything, can't we? :)
I should totally try to generate images using AI with some of these prompts!
FWIW, the results should be "good enough", considering they most likely have "pig" and "dragon" in their training data. I elaborated here on this: https://news.ycombinator.com/item?id=46006535.
That little quip from Hume has influenced my thinking so much Im happy to see it again
I agree, I love him and he has been a very influential person in my life. I started reading him from a very young age in my own language because his works in English were too difficult for me at the time. It is always nice to see someone mention him.
FWIW I do not think he used the "pig with dragon head" example, it just came to my mind, but he did use an example similar to it when he was talking about creativity and the combining of ideas where there was a lack of impression (i.e. we have not actually seen one anywhere [yet we can imagine it]).
> Edit: to clarify further as to what I want to know: people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?
That is not true and those people are dumb. You may be on Bluesky too much.
If your training data is a bunch of integer additions and you lossily compress this into a model which rediscovers integer addition, it can now add other numbers. Was that in the training data?
It was in the training data. There is implicit information in the way you present each addition. The context provided in the training data is what allows relationships to be perceived and modelled.
If you don't have that in your data you don't have the results.
I am not on Bluesky AT ALL. I have seen this argument here on HN, which is the only "social media" website I use.
I mean, you just said it was.
It wasn't necessarily. You could redefine the "true meaning" of the training data such that it wasn't an addition operation but was actually some other one, with the same data, and then the generalization would be wrong.
Creativity need to be better defined. And the rest is a learning problem. If you keep on training, learning what you see ...
I think it's more about multidimensionality than anything
> I have seen the argument that LLMs can only give you what its been trained
There's confusing terminology here and without clarification people talk past one another."What its been trained on" is a distribution. It can produce things from that distribution and only things from that distribution. If you train on multiple distributions, you get the union of the distribution, making a distribution.
This is entirely different from saying it can only reproduce samples which it was trained on. It is not a memory machine that is surgically piecing together snippets of memorized samples. (That would be a mind bogglingly impressive machine!)
A distribution is more than its samples. It is the things between too. Does the LLM perfectly capture the distribution? Of course not. But it's a compression machine so it compresses the distribution. Again, different from compressing the samples, like one does with a zip file.
So distributionally, can it produce anything novel? No, of course not. How could it? It's not magic. But sample wise can it produce novel things? Absolutely!! It would be an incredibly unimpressive machine if it couldn't and it's pretty trivial to prove that it can do this. Hallucinations are good indications that this happens but it's impossible to do on anything but small LLMs since you can't prove any given output isn't in the samples it was trained on (they're just trained on too much data).
> people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?
Up until very recently most LLMs have struggled with the prompt Solve:
5.9 = x + 5.11
This is certainly in their training distribution and has been for years, so I wouldn't even conclude that they can solve problems "in their training data". But that's why I said it's not a perfect model of the distribution. > a pig with a dragon head
One needs to be quite careful with examples as you'll have to make the unverifiable assumption that such a sample does not exist in the training data. With the size of training data this is effectively unverifiable.But I would also argue that humans can do more than that. Yes, we can combine concepts, but this is a lower level of intelligence that is not unique to humans. A variation of this is applying a skill from one domain into another. You might see how that's pretty critical to most animals survival. But humans, we created things that are entirely outside nature require things outside a highly sophisticated cut and paste operation. Language, music, mathematics, and so much more are beyond that. We could be daft and claim music is simply cut and paste of songs which can all naturally be reproduced but that will never explain away the feelings or emotion that it produces. Or how we formulated the sounds in our heads long before giving them voice. There is rich depth to our experiences if you look. But doing that is odd and easily dismissed as our own familiarity deceives us into our lack of.
The limit of a LLM "distribution" effectively is actually only at the token level though once the model has consumed enough language. Which is why those out of distribution tokens are so problematic.
From that point on the model can infer linguistics even on purely encountered words, concepts. I would even propose in context inferred meaning based on context, just like you would do.
It builds conceptual abstractions of MANY levels and all interrelated.
So imagine giving it a task like "design a car for a penguin to drive". The LLM can infer what kinda of input does a car need, what anatomy does a penguin have and it can wire it up descriptively. It is an easy task for an LLM. When you think about the other capabilities like introspection, and external state through observation (any external input), there really are not many fundamental limits on what they can do.
(Ignore image generation, it is an important distinction on how an image is made, end to end sequence vs. pure diffusion vs. hybrid.)
> This is entirely different from saying it can only reproduce samples which it was trained on. It is not a memory machine that is surgically piecing together snippets of memorized samples. (That would be a mind bogglingly impressive machine!)
You could create one of those using both a Markov chain and an LLM.
Though I enjoyed that paper, it's not quite the same thing. There's a bit more subtly to what I'm saying. To do a surgical patching you'd have to actually have a rich understanding of language but just not have the actual tools to produce words themselves. Think like the SciFi style robots that pull together clips or recordings to speak. Bumblebee from transformers might be the most well known example. But think hard about that because it requires a weird set of conditions and a high level of intelligence to perform the search and stitching.
But speaking of Markov, we get that in LLMs through generation. We don't have conversations with them. Each chat is unique since you pass it the entire conversation. There's no memory. So the longer your conversations go the larger the token counts. That's Markovian ;)
> I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".
LLMs can absolutely create things that are creative, at least for some definition of "creative".
For example, I can ask an LLM to create a speech about cross-site scripting the style of Donald Trump:
> Okay, folks, we're talking about Cross-Site Scripting, alright? I have to say, it's a bit confusing, but let's try to understand it. They call it XSS, which is a fancy term. I don't really know what it means, but I hear it's a big deal in the tech world. People are talking about it, a lot of people, very smart people. So, Cross-Site Scripting. It's got the word "scripting" in it, which sounds like it's about writing, maybe like a script for a movie or something. But it's on the internet, on these websites, okay? And apparently, it's not good. I don't know exactly why, but it's not good. Bad things happen, they tell me. Maybe it makes the website look different, I don't know. Maybe it makes things pop up where they shouldn't. Could be anything! But here's what I do know. We need to do something about it. We need to get the best people, the smartest people, to look into it. We'll figure it out, folks. We'll make our websites safe, and we'll do it better than anyone else. Trust me, it'll be tremendous. Thank you.
Certainly there's no text out there that contains a speech about XSS from Trump. There's some snippets here and there that likely sound like Trump, but a Markov Chain simply is incapable of producing anything like this.
Sure that specific text does not exist, but the discrete tokens that went into it would have been.
If you similarly trained a Markov chain at the token level on a LLM sized corpus, it could make the same. Lacking an attention mechanism, the token probabilities would be terribly non constructive for the effort, but it is not impossible.
Let's assume three things here:
1. The corpus contains every Trump speech.
2. The corpus contains everything ever written about XSS.
3. The corpus does NOT contain Trump talking about XSS, nor really anything that puts "Trump" and "XSS" within the same page.
A Markov Chain could not produce a speech about XSS in the style of Trump. The greatest tuning factor for a Markov Chain is the context length. A short length (like 2-4 words) produces incoherent results because it only looks at the last 2-4 words when predicting the next word. This means if you prompted the chain with "Create a speech about cross-site scripting the style of Donald Trump", then even with a 4-word context, all the model processes is "style of Donald Trump". But the time it reached the end of the prompt, it's already forgotten the beginning of it.
If you increase the context to 15, then the chain would produce nothing because "Create a speech about cross-site scripting in the style of Donald Trump" has never appeared in its corpus, so there's no data for what to generate next.
The matching in a Markov Chain is discrete. It's purely a mapping of (series of tokens) -> (list of possible next tokens). If you pass in a series of tokens that was never seen in the training set, then the list of possible next tokens is an empty set.
Oh, of course, what I want answered did not have much to do with Markov Chain, but LLMs, because I saw this argument often against LLMs.
>> The fact that they only generate sequences that existed in the source
> I am quite confused right now. Could you please help me with this?
This is pretty straightforward. Sohcahtoa82 doesn't know what he's saying.
I'm fully open to being corrected. Just telling me I'm wrong without elaborating does absolutely nothing to foster understanding and learning.
> This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.
Or, in other words, a Markov Chain won't hallucinate. Having a system that only repeats sentences from it's source material and doesn't create anything new on its own is quite useful on some scenarios.
> Or, in other words, a Markov Chain won't hallucinate.
It very much can. Remember, the context windows used for Markov Chains are usually very short, usually in the single digit numbers of words. If you use a context length of 5, then when asking it what the next word should be, it has no idea what the words were before the current context of 5 words. This results in incoherence, which can certainly mean hallucinations.
A Markov chain certainly will not hallucinate, because we define hallucinations as garbage within otherwise correct output. A Markov chain doesn't have enough correct output to consider the mistakes "hallucinations", but in a sense that nothing is a hallucination when everything is one.
You can very easily inject wrong information into the state transition function. And machine learning can and regularly does do so. That is not a difference between an LLM and a markov chain.
Building a static table of seen inputs is just one way of building a state transition table for a markov chain. It is just an implementation detail of a function (in the mathematical sense of the word. no side effects) that takes in some input and outputs a probability distribution.
You could make the table bigger and fill in the rest yourself. Or you could use machine learning to do it. But then the table would be too huge to actually store due to combinatorial explosion. So we find a way to reduce that memory cost. How about we don't precompute the whole table and lazily evaluate individual cells as/when they are needed? You achieve that by passing it through the machine-learned function (a trained network is a fixed function with no side effects.) You might say but that's not the same thing!!! But remember that the learned network will always output the same distribution if given the same input, because it is a function. Let's say you have a context size of 1000 and have 10 possible tokens. There are 10^1000 possible inputs. A huge number, but most importantly, a finite number. So you could in theory feed them all in one at a time and record the result in a table. While we can't really do this in practice, because the resulting table would be huge. It is, for all mathematical purposes, equivalent and you could, in theory, freely transform one into the other.
Et voila! You have built a markov chain anyway. Previous input goes in, magic happens inside (whichever implementation you used to implement the function doesn't matter), and a probability distribution comes out. It's a markov chain. It doesn't quack and walk like a duck. It IS an actual duck.
"This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative."
That reads like anything useful needs to be creative? I would disagree here. A digital assistent, in control of a automatic door for example should not be creative, but stupidly do exactly as told. "Open the door". "Close the door". And the "creativity" of KI agents I see rather as a danger here.
Most physical problems require some level of creativity: your door opening robot should be able to handle some level of dust on the handle, some level of slipperiness of the floor, some amount of packages blocking where it wants to stand, and crucially: some level of all of those things where it gives up rather than causing damage.
You can't open-loop everything, and the edge cases in a closed-loop absolutely explodes.
Yes, a “digital assistant“ responsible only for handling a door is manageable, but even “get a pot, fill it with water, and boil it” gets _remarkably_ complicated if you need reduce all the edge cases to known regions of behaviour that you can pre-program responses to.
Yes, but I want those sensor input and fail modes handled with deterministic classical algorithms without (LLM) creativity, otherwise I envision doors that refuse to open on certain dates, because somewhere in its training corpus was a satirical sci fi story about depressed smart doors.
(And no, I was not talking about a robot butler, but a automatic door. Infrared triggered, but enhanced with a text mode for control)
> A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material.
Strictly speaking, this is true of one particular way (the most straightforward) to derive a Markov chain from a body of text; a Markov chain is just a probabilistic model of state transitions where the probability of each possible next state is dependent only on the current state. Having the states be word sequence of some number of words, overlapping by all but one word, and having the probabilities being simply the frequency with which the added word in the target state follows the sequence in the source state in the training corpus is one way you can derive a Markov chain from a body of text, but not the only one.
> A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material.
I see this as a strength; try training an LLM on a 42KB text to see if it can produce a coherent output.
I've had a lot of fun training Markov chains using Simple English Wikipedia. I'm guessing the restricted vocabulary leads to more overlapping sentences in the training data. Anything too advanced or technical has too many unique phrases and the output degrades almost immediately.
> You'll find that the resulting output becomes incoherent garbage.
I also do that kind of things with LLM. The other day, I don't remember the prompt (something casual really, not trying to trigger any issue) but le chat mistral started to regurgitate "the the the the the...".
And this morning I was trying a some local models, trying to see if they could output some Esperanto. Well, that was really a mess of random morphs thrown together. Not syntactically wrong, but so out of touch with any possible meaningful sentence.
Yeah, some of the failure modes are the same. This one in particular is fun because even a human, given "the the the" and asked to predict what's next will probably still answer "the". How a Markov chain starts the the train and how the LLM does are pretty different though.
I wonder if "X is not Y - its' Z" LLM shibboleth is just an artifact of "is not" being a third most common bigram starting with is, just after "is a" and "is the" [0]. It doesn't follow as simply as it does with markov chains, but maybe this is where the tendency originated, and later was trained and RLHFed into the shape that kind of makes sense instead of getting eliminated.
I never saw any human starting to loop "the" as a reaction to any utterance though.
Personally my concern is more about the narrative that LLM are making "chain of thoughts", can "hallucinante" and that people should become "AI complement". They are definitely making nice inferences most of the time, but they are also totally different thing compared to human thoughts.
If you learn with Baum Welch you can get nonzero ood probabilities.
Something like Markov Random Field is much better.
Not sure if anyone managed to create latent hierarchies from chars to words to concepts. Learning NNs is far more tinkery than brutality of probabilistic graphical models.
Uhhh... the above comment has a bunch of loose assertions that are not quite true, but with a enough truthiness that makes them hard to refute. So I'll point to my other comment for a more nuanced comparison of Markov models with tiny LLMs: https://news.ycombinator.com/item?id=45996794
To add to this, the system offering text generation, i.e. the loop that builds the response one token at a time generated by a LLM (and at the same time feeds the LLM the text generated so far) is a Markov Model, where the transition matrix is replaced by the LLM, and the state space is the space of all texts.
>This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.
Its funny cause they say the same thing about LLMs (sort of)
Markov Chains could be applied on top of embeddings just as well though
Markov chains of order n are essentially n-gram models - and this is what language models used to be for a very long time. They are quite good. As a matter of fact, they were so good that more sophisticated models often couldn't beat them.
But then came deep-learning models - think transformers. Here, you don't represent your inputs and states discretely but you have a representation in a higher-dimensional space that aims at preserving some sort of "semantics": proximity in that space means proximity in meaning. This allows to capture nuances much more finely than it is possible with sequences of symbols from a set.
Take this example: you're given a sequence of n words and are to predict a good word to follow that sequence. That's the thing that LM's do. Now, if you're an n-gram model and have never seen that sequence in training, what are you going to predict? You have no data in your probabilty tables. So what you do is smoothing: you take away some of the probability mass that you have assigned during training to the samples you encountered and give it to samples you have not seen. How? That's the secret sauce, but there are multiple approaches.
With NN-based LLMs, you don't have that exact same issue: even if you have never seen that n-word sequence in training, it will get mapped into your high-dimensional space. And from there you'll get a distribution that tells you which words are good follow-ups. If you have seen sequences of similar meaning (even with different words) in training, these will probably be better predictions.
But for n-grams, just because you have seen sequences of similar meaning (but with different words) during training, that doesn't really help you all that much.
In theory, you could have a large enough markov chain that mimicks an LLM, you would just need it to be exponentially larger in width.
After all, its just matrix multplies start to finish.
A lot of the other data operation (like normalization) can be represented as matrix multiplies, just less efficiently. In the same way that a transformer can be represented inefficiency as a set of fully connected deep layers.
True. But the considerations re: practicability are not to be ignored.
>just because you have seen sequences of similar meaning (but with different words) during training, that doesn't really help you all that much.
Sounds solvable with synonyms? The same way keyword search is brittle but does much better when you add keyword expansion.
Probably the arbitrariness of grammar would nuke performance here. You'd want to normalize the sentence structure too. Hmm...
> So what you do is smoothing: you take away some of the probability mass that you have assigned during training to the samples you encountered and give it to samples you have not seen.
And then you can build a trillion dollar industry selling hallucinations.
yes, but on this n-gram vs transformers; if you consider more general paradigm, self attention mechanism is basically a special form of a graph neural networks [1].
[1] Bridging Graph Neural Networks and Large Language Models: A Survey and Unified Perspective https://infoscience.epfl.ch/server/api/core/bitstreams/7e6f8...
Other comments in this thread do a good job explaining the differences in the Markov algorithm vs the transformer algorithm that LLMs use.
I think it's worth mentioning that you have indeed identified a similarity, in that both LLMs and Markov chain generators have the same algorithm structure: autoregressive next-token generation.
Understanding Markov chain generators is actually a really really good step towards understanding how LLMs work, overall, and I think its a really good pedagogical tool.
Once you understand Markov generating, doing a bit of handwaving to say "and LLMs are just like this except with a more sophisticated statistical approach" has the benefit of being true, demystifying LLMs, and also preserving a healthy respect for just how powerful that statistical model can be.