AlphaGenome: AI for better understanding the genome

2025-06-2614:16508170deepmind.google

Introducing a new, unifying DNA sequence model that advances regulatory variant-effect prediction and promises to shed new light on genome function — now available via API.

Science

Ziga Avsec and Natasha Latysheva

A central, light-blue DNA double helix stands in sharp focus, flanked by a series of DNA strands that fade into a soft, blurry background, giving the impression of a field of genetic information. The backdrop is bathed in a soft light that transitions from pink to purple.

Introducing a new, unifying DNA sequence model that advances regulatory variant-effect prediction and promises to shed new light on genome function — now available via API.

The genome is our cellular instruction manual. It’s the complete set of DNA which guides nearly every part of a living organism, from appearance and function to growth and reproduction. Small variations in a genome’s DNA sequence can alter an organism’s response to its environment or its susceptibility to disease. But deciphering how the genome’s instructions are read at the molecular level — and what happens when a small DNA variation occurs — is still one of biology’s greatest mysteries.

Today, we introduce AlphaGenome, a new artificial intelligence (AI) tool that more comprehensively and accurately predicts how single variants or mutations in human DNA sequences impact a wide range of biological processes regulating genes. This was enabled, among other factors, by technical advances allowing the model to process long DNA sequences and output high-resolution predictions.

To advance scientific research, we’re making AlphaGenome available in preview via our AlphaGenome API for non-commercial research, and planning to release the model in the future.

We believe AlphaGenome can be a valuable resource for the scientific community, helping scientists better understand genome function, disease biology, and ultimately, drive new biological discoveries and the development of new treatments.

How AlphaGenome works

Our AlphaGenome model takes a long DNA sequence as input — up to 1 million letters, also known as base-pairs — and predicts thousands of molecular properties characterising its regulatory activity. It can also score the effects of genetic variants or mutations by comparing predictions of mutated sequences with unmutated ones.

Predicted properties include where genes start and where they end in different cell types and tissues, where they get spliced, the amount of RNA being produced, and also which DNA bases are accessible, close to one another, or bound by certain proteins. Training data was sourced from large public consortia including ENCODE, GTEx, 4D Nucleome and FANTOM5, which experimentally measured these properties covering important modalities of gene regulation across hundreds of human and mouse cell types and tissues.

Animation showing AlphaGenome taking one million DNA letters as input and predicting diverse molecular properties across different tissues and cell types.

The AlphaGenome architecture uses convolutional layers to initially detect short patterns in the genome sequence, transformers to communicate information across all positions in the sequence, and a final series of layers to turn the detected patterns into predictions for different modalities. During training, this computation is distributed across multiple interconnected Tensor Processing Units (TPUs) for a single sequence.

This model builds on our previous genomics model, Enformer and is complementary to AlphaMissense, which specializes in categorizing the effects of variants within protein-coding regions. These regions cover 2% of the genome. The remaining 98%, called non-coding regions, are crucial for orchestrating gene activity and contain many variants linked to diseases. AlphaGenome offers a new perspective for interpreting these expansive sequences and the variants within them.

AlphaGenome offers several distinctive features compared to existing DNA sequence models:

Long sequence-context at high resolution

Our model analyzes up to 1 million DNA letters and makes predictions at the resolution of individual letters. Long sequence context is important for covering regions regulating genes from far away and base-resolution is important for capturing fine-grained biological details.

Previous models had to trade off sequence length and resolution, which limited the range of modalities they could jointly model and accurately predict. Our technical advances address this limitation without significantly increasing the training resources — training a single AlphaGenome model (without distillation) took four hours and required half of the compute budget used to train our original Enformer model.

Comprehensive multimodal prediction

By unlocking high resolution prediction for long input sequences, AlphaGenome can predict the most diverse range of modalities. In doing so, AlphaGenome provides scientists with more comprehensive information about the complex steps of gene regulation.

Efficient variant scoring

In addition to predicting a diverse range of molecular properties, AlphaGenome can efficiently score the impact of a genetic variant on all of these properties in a second. It does this by contrasting predictions of mutated sequences with unmutated ones, and efficiently summarising that contrast using different approaches for different modalities.

Novel splice-junction modeling

Many rare genetic diseases, such as spinal muscular atrophy and some forms of cystic fibrosis, can be caused by errors in RNA splicing — a process where parts of the RNA molecule are removed, or “spliced out”, and the remaining ends rejoined. For the first time, AlphaGenome can explicitly model the location and expression level of these junctions directly from sequence, offering deeper insights about the consequences of genetic variants on RNA splicing.

State-of-the-art performance across benchmarks

AlphaGenome achieves state-of-the-art performance across a wide range of genomic prediction benchmarks, such as predicting which parts of the DNA molecule will be in close proximity, whether a genetic variant will increase or decrease expression of a gene, or whether it will change the gene’s splicing pattern.

Bar graph showing AlphaGenome’s relative improvements on selected DNA sequence and variant effect tasks, compared against results for the current best methods in each category.

When producing predictions for single DNA sequences, AlphaGenome outperformed the best external models on 22 out of 24 evaluations. And when predicting the regulatory effect of a variant, it matched or exceeded the top-performing external models on 24 out of 26 evaluations.

This comparison included models specialized for individual tasks. AlphaGenome was the only model that could jointly predict all of the assessed modalities, highlighting its generality. Read more in our preprint.

The benefits of a unifying model

AlphaGenome’s generality allows scientists to simultaneously explore a variant's impact on a number of modalities with a single API call. This means that scientists can generate and test hypotheses more rapidly, without having to use multiple models to investigate different modalities.

Moreover AlphaGenome’s strong performance indicates it has learned a relatively general representation of DNA sequence in the context of gene regulation. This makes it a strong foundation for the wider community to build upon. Once the model is fully released, scientists will be able to adapt and fine-tune it on their own datasets to better tackle their unique research questions.

Finally, this approach provides a flexible and scalable architecture for the future. By extending the training data, AlphaGenome’s capabilities could be extended to yield better performance, cover more species, or include additional modalities to make the model even more comprehensive.

It’s a milestone for the field. For the first time, we have a single model that unifies long-range context, base-level precision and state-of-the-art performance across a whole spectrum of genomic tasks.

Dr. Caleb Lareau, Memorial Sloan Kettering Cancer Center

AlphaGenome's predictive capabilities could help several research avenues:

  1. Disease understanding: By more accurately predicting genetic disruptions, AlphaGenome could help researchers pinpoint the potential causes of disease more precisely, and better interpret the functional impact of variants linked to certain traits, potentially uncovering new therapeutic targets. We think the model is especially suitable for studying rare variants with potentially large effects, such as those causing rare Mendelian disorders.
  2. Synthetic biology: Its predictions could be used to guide the design of synthetic DNA with specific regulatory function — for example, only activating a gene in nerve cells but not muscle cells.
  3. Fundamental research: It could accelerate our understanding of the genome by assisting in mapping its crucial functional elements and defining their roles, identifying the most essential DNA instructions for regulating a specific cell type's function.

For example, we used AlphaGenome to investigate the potential mechanism of a cancer-associated mutation. In an existing study of patients with T-cell acute lymphoblastic leukemia (T-ALL), researchers observed mutations at particular locations in the genome. Using AlphaGenome, we predicted that the mutations would activate a nearby gene called TAL1 by introducing a MYB DNA binding motif, which replicated the known disease mechanism and highlighted AlphaGenome’s ability to link specific non-coding variants to disease genes.

AlphaGenome will be a powerful tool for the field. Determining the relevance of different non-coding variants can be extremely challenging, particularly to do at scale. This tool will provide a crucial piece of the puzzle, allowing us to make better connections to understand diseases like cancer.

Professor Marc Mansour, University College London

AlphaGenome marks a significant step forward, but it's important to acknowledge its current limitations.

Like other sequence-based models, accurately capturing the influence of very distant regulatory elements, like those over 100,000 DNA letters away, is still an ongoing challenge. Another priority for future work is further increasing the model’s ability to capture cell- and tissue-specific patterns.

We haven't designed or validated AlphaGenome for personal genome prediction, a known challenge for AI models. Instead, we focused more on characterising the performance on individual genetic variants. And while AlphaGenome can predict molecular outcomes, it doesn't give the full picture of how genetic variations lead to complex traits or diseases. These often involve broader biological processes, like developmental and environmental factors, that are beyond the direct scope of our model.

We’re continuing to improve our models and gathering feedback to help us address these gaps.

Enabling the community to unlock AlphaGenome's potential

AlphaGenome is now available for non-commercial use via our AlphaGenome API. Please note that our model’s predictions are intended only for research use and haven’t been designed or validated for direct clinical purposes.

Researchers worldwide are invited to get in touch with potential use-cases for AlphaGenome and to ask questions or share feedback through the community forum.

We hope AlphaGenome will be an important tool for better understanding the genome and we’re committed to working alongside external experts across academia, industry, and government organizations to ensure AlphaGenome benefits as many people as possible.

Together with the collective efforts of the wider scientific community, we hope it will deepen our understanding of the complex cellular processes encoded in the DNA sequence and the effects of variants, and drive exciting new discoveries in genomics and healthcare.

Acknowledgements

We would like to thank Juanita Bawagan, Arielle Bier, Stephanie Booth, Irina Andronic, Armin Senoner, Dhavanthi Hariharan, Rob Ashley, Agata Laydon and Kathryn Tunyasuvunakool for their help with the text and figures.

This work was done thanks to the contributions of the AlphaGenome co-authors: Žiga Avsec, Natasha Latysheva, Jun Cheng, Guido Novati, Kyle R. Taylor, Tom Ward, Clare Bycroft, Lauren Nicolaisen, Eirini Arvaniti, Joshua Pan, Raina Thomas, Vincent Dutordoir, Matteo Perino, Soham De, Alexander Karollus, Adam Gayoso, Toby Sargeant, Anne Mottram, Lai Hong Wong, Pavol Drotár, Adam Kosiorek, Andrew Senior, Richard Tanburn, Taylor Applebaum, Souradeep Basu, Demis Hassabis and Pushmeet Kohli.

We would also like to thank Dhavanthi Hariharan, Charlie Taylor, Ottavia Bertolli, Yannis Assael, Alex Botev, Anna Trostanetski, Lucas Tenório, Victoria Johnston, Richard Green, Kathryn Tunyasuvunakool, Molly Beck, Uchechi Okereke, Rachael Tremlett, Sarah Chakera, Ibrahim I. Taskiran, Andreea-Alexandra Muşat, Raiyan Khan, Ren Yi and the greater Google DeepMind team for their support, help and feedback.


Read the original article

Comments

  • By Kalanos 2025-06-2713:522 reply

    The functional predictions related to "non-coding" variants are big here. Non-coding regions, referred to as the dark genome, produce regulatory non-coding RNA's that determine the level of gene expression in a given cell type. There are more regulatory RNA's than there are genes. Something like 75% of expression by volume is ncRNA.

    • By dekhn 2025-06-2719:31

      There is a big long-running argument about what "functional" means in "non-coding" parts of the genome. The deeper I pushed into learning about the debate the less confident I became of my own understanding of genomics and evolution. See https://www.sciencedirect.com/science/article/pii/S096098221... for one perspective.

    • By wespiser_2018 2025-06-2714:30

      It's possible that the "functional" aspect of non-coding RNA exists on a time scale much larger that what we can assay in a lab. The sort of "junk DNA/RNA" hypothesis: the ncRNA part of the genome is material that increases fitness during relative rare events where it's repurposed into something else.

      On a millions or billions of year time frame, the organisms with the flexibility of ncRNA would have an advantage, but this is extremely hard to figure out with a "single point in time" view point.

      Anyway, that was the basic lesson I took from studying non-coding RNA 10 years ago. Projects like ENCODE definitely helped, but they really just exposed transcription of elements that are noisy, without providing the evidence that any of it is actually "functional". Therefore, I'm skeptical that more of the same approach will be helpful, but I'd be pleasantly surprised if wrong.

  • By b0a04gl 2025-06-2719:31

    1mbp context makes so much sense here wow. genome's flat yeah but reg stuff's like.. all over : loops, timing, chromatin state. model needs that whole view just to even line it up right. giving it enough space to rewire what the cell's already doing. and the transformer memory just clicks here and actually fits.

  • By RivieraKid 2025-06-2620:5611 reply

    I wish there's some breakthrough in cell simulation that would allow us to create simulations that are similarly useful to molecular dynamics but feasible on modern supercomputers. Not being able to see what's happening inside cells seems like the main blocker to biological research.

    • By bglazer 2025-06-271:521 reply

      Molecular dynamics describes very short, very small dynamics, like on the scale of nanoseconds and angstroms (.1nm)

      What you’re describing is more like whole cell simulation. Whole cells are thousands of times larger than a protein and cellular processes can take days to finish. Cells contain millions of individual proteins.

      So that means that we just can’t simulate all the individual proteins, it’s way too costly and might permanently remain that way.

      The problem is that biology is insanely tightly coupled across scales. Cancer is the prototypical example. A single mutated letter in DNA in a single cell can cause a tumor that kills a blue whale. And it works the other way too. Big changes like changing your diet gets funneled down to epigenetic molecular changes to your DNA.

      Basically, we have to at least consider molecular detail when simulating things as large as a whole cell. With machine learning tools and enough data we can learn some common patterns, but I think both physical and machine learned models are always going to smooth over interesting emergent behavior.

      Also you’re absolutely correct about not being able to “see” inside cells. But, the models can only really see as far as the data lets them. So better microscopes and sequencing methods are going to drive better models as much as (or more than) better algorithms or more GPUs.

    • By mbeavitt 2025-06-2710:161 reply

      Simulating the real world at increasingly accurate scales is not that useful, because in biology - more than any other field - our assumptions are incorrect/flawed most of the time. The most useful thing simulations allow us to do is directly test those assumptions and in these cases, the simpler the model the better. Jeremy Gunawardena wrote a great piece on this: https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007...

      • By kylehotchkiss 2025-06-2716:33

        And the extremely difficult, expensive, and often resultless process of confirming/denying these assumptions is one of the greatest uses of tax dollars and university degrees I can think of, yet, the current admin has taken the perspective that it's all Miasma but also cut the EPA, which by their logic, would stop the Miasma

    • By andrewchoi 2025-06-2622:121 reply

      The folks at Arc are trying to build this! https://arcinstitute.org/news/virtual-cell-model-state

      • By dekhn 2025-06-2622:58

        STATE is not a simulation. It's a trained graphical model that does property prediction as a result of a perturbation. There is no physical model of a cell.

        Personally, I think arc's approach is more likely to produce usable scientific results in a reasonable amount of time. You would have to make a very coarse model of the cell to get any reasonable amount of sampling and you would probably spend huge amounts of time computing things which are not relevant to the properties you care amount. An embedding and graphical model seems well-suited to problems like this, as long as the underlying data is representative and comprehensive.

    • By kylehotchkiss 2025-06-2716:32

      How can you simulate what is not yet reliably known? Ugh it's so frustrating to hear AI 'thought leaders' going on and on about this being a pancea, especially when a majority of funding for the research even needed to train models has been substantially cut so Elon could have more rocket dollars

    • By ahns 2025-06-272:431 reply

      You may enjoy this, from a top-down experimental perspective (https://www.nikonsmallworld.com/galleries/small-world-in-mot...). Only a few entries so far show intracellular dynamics (like this one: https://www.nikonsmallworld.com/galleries/2024-small-world-i...), but I always enjoy the wide variety of dynamics some groups have been able to capture, like nervous system development (https://www.nikonsmallworld.com/galleries/2018-small-world-i...); absolutely incredible.

    • By tim333 2025-06-2710:221 reply

      It's a main aim at DeepMind. I hope they succeed as it could be very useful.

      • By RivieraKid 2025-06-2714:431 reply

        Do they specifically state that it's their main aim anywhere?

        Edit: Never mind, I've googled the answer.

        • By RivieraKid 2025-06-2717:43

          It seems that this would be a very coarse-grained simulation of a cell, nowhere close to the usefulness to a proper molecular dynamics simulation, if I understand correctly.

    • By t_serpico 2025-06-270:50

      'Seeing' inside cells/tissues/organs/organisms is pretty much most modern biological research.

    • By eleveriven 2025-06-277:06

      What's missing feels like the equivalent of a "fast-forward" button for cell-scale dynamics

    • By j7ake 2025-06-273:082 reply

      Why simulate? We can already do it experimentally

      • By mnw21cam 2025-06-2713:28

        In my field, we're always wanting to see what will happen when DNA is changed in a human pancreatic beta cell. We kind of have a protocol for producing things that look like human pancreatic beta cells from human stem cells, but we're not really sure that they are really going to behave like real human pancreatic beta cells for any particular DNA change, and we have examples of cases where they definitely do not behave the same.

      • By tim333 2025-06-2710:21

        You can't see what's going on in most cases.

    • By m3kw9 2025-06-2621:23

      I believe this is where quantum computing comes in but could be a decade out, but AI acceleration is hard to predict

    • By noduerme 2025-06-2622:33

      I wish there were more interest in general in building true deterministic simulations than black boxes that hallucinate and can't show their work.

HackerNews