Transformer neural net learns to run Conway's Game of Life just from examples

2025-05-179:316935sidsite.com

The site of Sid

Show article

We find that a highly simplified transformer neural network is able to compute Conway’s Game of Life perfectly, just from being trained on examples of the game.

The simple nature of this model allows us to look at its structure and observe that it really is computing the Game of Life — it is not a statistical model that predicts the most likely next state based on all the examples it has been trained on.

We observe that it learns to use its attention mechanism to compute 3x3 convolutions — 3x3 convolutions are a common way to implement the Game of Life, since it can be used to count the neighbours of a cell, which is used to decide whether the cell lives or dies.

We refer to the model as SingleAttentionNet, because it consists of a single attention block, with single-head attention. The model represents a Life grid as a set of tokens, with one token per grid cell.

The following figure shows a Life game, computed by a SingleAttentionNet model:

The following figure shows examples of the SingleAttentionNet model’s attention matrix, over the course of training:

This shows the model learning to compute a 3 by 3 average pool via its attention mechanism, (with the middle cell excluded from the average).

Details

The full code is made available, here.

The problem is modeled as:

model(life_grid) = next_life_grid

Where gradient descent is used to minimize the loss:

loss = cross_entropy(true_next_life_grid, predicted_next_life_grid)

Life grids are generated randomly, to provide a limitless source of training pairs, (life_grid, next_life_grid). Some examples:

Model diagram

The model in the diagram processes 2-by-2 Life grids, which means 4 tokens in total per grid. Blue text indicates parameters that are learned via gradient descent. The arrays are labelled with their shape, (with the batch dimension omitted).

Training

On a GPU, training the model takes anywhere from a couple of minutes, to 10 minutes, or fails to converge, depending on the seed and other training hyperparameters. The largest grid size we successfully trained was 16x16.

Notes

We tried replacing the attention layer of the model with a manually computed Neighbour Attention matrix, and found the model learned its task far quicker, and generalised to arbitrary grid sizes. We found that the same was true for replacing the layer with a 3-by-3 average pool.

We detected that the model had converged by looking for 1024 training batches with perfect predictions, and that it could perfectly run 100 Life games for 100 steps.

We found that it was enough to train the model on the first and second iterations of the random Life games, but it wasn’t enough to just train on the first iterations.

The rules of Life

Life takes place on a 2D grid with cells that are either dead or alive, (represented by 0 or 1). A cell has 8 neighbours, which are the cells immediately next to it on the grid.

To progress to the next Life step, the following rules are used:

If a cell has 3 neighbours, it will be alive in the next step, regardless of it’s current state, (alive or dead).
If a cell is alive and has 2 neighbours, it will stay alive in the next step.
Otherwise, a cell will be dead in the next step.

These rules are shown in the following plot.

References:

Citation:

@misc{radcliffe_life_transformer_2024,
  title={Training a Simple Transformer Neural Net on Conway's Game of Life},
  url={https://sidsite.com/posts/life-transformer/},
  howpublished={Main page: \url{https://sidsite.com/posts/life-transformer/}, GitHub repository: \url{https://github.com/sradc/life-transformer}},
  author={Radclffe, Sidney},
  year={2024},
  month={July}
}

Read the original article

Comments

By evrimoztamur 2025-05-1712:003 reply

I would like to point out a much more exciting modelling process, whereby neural networks extract the underlying boolean logic from simulation outputs: https://google-research.github.io/self-organising-systems/di...

I firmly believe that differentiable logic CA is the winner, in particular because it extracts the logic directly, and thus leads to generalize-able programs as opposed to staying stuck in matrix multiplication land.

By jimkoen 2025-05-181:32

To ruin this for everyone: The underlying optimization that enables these to run as computationally efficient as they do, is patented:

https://patents.google.com/patent/WO2023143707A1/en?inventor...

By max_ 2025-05-1714:381 reply

This is one of those papers that are so good I would like to keep it secret.

I shared it with a friend and he thought its wasn't that useful.

That made me happy since I knew my secret maybe safe.

By _dark_matter_ 2025-05-1715:51

I think it's interesting, but I don't see how it's useful. Can you describe what you think is useful about it?

By Legend2440 2025-05-1717:03

If you look at the learned gates, it does not directly extract the underlying rules of Conway's game of life. It has many more gates than are necessary and they have the same complex, uninterpretable structure you see in a neural network.

The training method they're using is the same as used for quantized neural networks. Your 'neurons' being logic gates doesn't mean you're doing logic, it's still statistics.

By constantcrying 2025-05-1711:222 reply

To be honest an unsurprising result.

But I think the paper fails to answer the most important question. It alleges that this isn't a statistical model: "it is not a statistical model that predicts the most likely next state based on all the examples it has been trained on.

We observe that it learns to use its attention mechanism to compute 3x3 convolutions — 3x3 convolutions are a common way to implement the Game of Life, since it can be used to count the neighbours of a cell, which is used to decide whether the cell lives or dies."

But it is never actually shown that this is the case. It later on isn't even alleged that this is true, rather the metric they use is that it gives the correct answers often enough, as a test for convergence and not that the net has converged to values which give the correct algorithm.

But there is no guarantee that it actually has learned the game. There are still learned parameters and the paper doesn't investigate if these parameters actually have converged to something where the Net is actually just a computation of the algorithm. The most interesting question is left unanswered.

By montebicyclelo 2025-05-1711:332 reply

The diagonal-looking attention matrix shown in the post is mathematically equivalent to 3 by 3 convolution. The model learns how to do that via its attention mechanism - it's not obvious that it would be able to do that via attention.

(This can be shown by comparing that attention matrix to a "manually computed Neighbour Attention matrix", which is known to be equivalent to 3 by 3 conv.)

By gwern 2025-05-1715:372 reply

It would be more convincing if they did an exhaustive enumeration and verified that for every possible 3x3 Life the learned NN was correct. How do I know looking at a speckled screenshot that it is exactly correct and there's not a little floating point error somewhere or something like that which results in 1 edge-case being slightly off? If the only testing is '100 Life games for 100 steps', that isn't water-tight. (While if you do exhaustive enumeration, well, it has to be correct, because the NN is deterministic and fixed and there's no way for it to go wrong then.)

By montebicyclelo 2025-05-188:40

Edit: increased the validation to 10,000 life grids for 100 steps, (taking 16 minutes to check), which is hopefully somewhat more convincing. That's 1,000,000 life steps computed without errors in total. Plus 32,000 steps computed without error during training.

When the attention grid is manually computed (to be equivalent to 3 by 3 conv), the model can be trained to be 100% perfect, verified by checking all 3 by 3 grid states. (And this manually computed attention matrix means that once the tokens reach the classifier layer, each token contains only the information of the relevant 3 by 3 grid, and the whole thing is deterministic as you say.)

However, when the model is computing the attention grid itself, just checking all 3 by 3 sub-grid states crop up is not enough, because the position of the sub-grids can impact the attention matrix, and also the state of other cells can impact the attention matrix. So as shown in the post, it does approximate 3 by 3 conv, but if it doesn't get the approximation quite right, there could be errors. But I would say that it's still computing the Game of Life algorithm in an interpretable way, it's just that maybe it has struggled to create a perfect 3 by 3 convolution via attention in that particular case. (To exhaustively check this, would require checking all 2 * (16x16) grids.)

By constantcrying 2025-05-1717:10

I think it would have also been very interesting to manually construct a NN, which represented the rules exactly. Maybe there is some nice mathematical way to describe them or some constraints need to be fulfilled.

Then afterwards you can check the neutral network against the exact algorithms.

By constantcrying 2025-05-1713:551 reply

Yes, I also quoted that part from the article. This does not address that the attention Matrix does not represent all learned parameters. Even supposing that the form of the attention matrix guarantees the correct functioning of the algorithm why was that not used as the metric to decide convergence?

"We detected that the model had converged by looking for 1024 training batches with perfect predictions, and that it could perfectly run 100 Life games for 100 steps." This would be superfluous (and even a pretty bizarre methodology) if the shape of the attention matrix was proof that the Network performed the actual game of life algorithm.

Just to be clear, I am not saying that the NN isn't converging to performing some computation that would also be seen in other algorithms. I am saying that the paper does not investigate whether the resulting NN actually performs the game of life algorithm. The convolution part is certainly evidence, but I think it would have been worthwhile to look at the actual resulting Net and figure out if the trained weights together actually formed an algorithm. This is also the only way to determine the truth of the initial claim, that this isn't just a statistical model, but rather an actual algorithm.

By montebicyclelo 2025-05-1714:201 reply

> the paper does not investigate whether the resulting NN actually performs the game of life algorithm

How could it not be computing the game of life algorithm? Given that it gets 100% accuracy over multiple steps on a bunch random game boards it's never seen before.

And then based on the structure of the net, and by examining the attention layers and finding that it's doing 3 by 3 average pooling, we can see that the attention layer produces a set of tokens, where each token contains the information of the number of neighbours it had, and its previous state. This then goes through a classifier layer, which decides it's next state, given that information.

Further evidence for that: it was possible to use linear probes to confirm that the tokens that had been through the attention layer contained the information about the number of neighbours and the previous state.

From all of this, it's clear that the model is running the Game of Life properly.

By constantcrying 2025-05-1714:331 reply

Do you not understand the difference between empirical evidence and mathematical proof? Surely every person talking about NN research should be aware of that distinction.

> How could it not be computing the game of life algorithm? Given that it gets 100% accuracy over multiple steps on a bunch random game boards it's never seen before.

This is such an insane statement.

By montebicyclelo 2025-05-188:32

> This is such an insane statement.

In what way? Maybe you mean something different when you say computing the game of Life algorithm.

By Y_Y 2025-05-1711:27

Reminds me of this great story about a programmer-turned-businessman who tried to learn a game from examples and ended up with an almost-correct brute force solution:

https://www.borrett.id.au/computing/petals-bg.htm

By Dwedit 2025-05-1713:10

Rip John Conway, died of Covid.

Hacker News