SimpleFold: Folding proteins is simpler than you think

2025-09-2618:0119682github.com

Contribute to apple/ml-simplefold development by creating an account on GitHub.

We introduce SimpleFold, the first flow-matching based protein folding model that solely uses general purpose transformer layers. SimpleFold does not rely on expensive modules like triangle attention or pair representation biases, and is trained via a generative flow-matching objective. We scale SimpleFold to 3B parameters and train it on more than 8.6M distilled protein structures together with experimental PDB data. To the best of our knowledge, SimpleFold is the largest scale folding model ever developed. On standard folding benchmarks, SimpleFold-3B model achieves competitive performance compared to state-of-the-art baselines. Due to its generative training objective, SimpleFold also demonstrates strong performance in ensemble prediction. SimpleFold challenges the reliance on complex domain-specific architectures designs in folding, highlighting an alternative yet important avenue of progress in protein structure prediction.

To install simplefold package from github repository, run

git clone https://github.com/apple/ml-simplefold.git
cd ml-simplefold
python -m pip install -U pip build; pip install -e .
pip install git+https://github.com/facebookresearch/esm.git # Optional for MLX backend

We provide a jupyter notebook sample.ipynb to predict protein structures from example protein sequences.

Once you have simplefold package installed, you can predict the protein structure from target fasta file(s) via the following command line. We provide support for both PyTorch and MLX (recommended for Apple hardware) backends in inference.

simplefold \
    --simplefold_model simplefold_100M \  # specify folding model in simplefold_100M/360M/700M/1.1B/1.6B/3B
    --num_steps 500 --tau 0.01 \        # specify inference setting
    --nsample_per_protein 1 \           # number of generated conformers per target
    --plddt \                           # output pLDDT
    --fasta_path [FASTA_PATH] \         # path to the target fasta directory or file
    --output_dir [OUTPUT_DIR] \         # path to the output directory
    --backend [mlx, torch]              # choose from MLX and PyTorch for inference backend 

We provide predicted structures from SimpleFold of different model sizes:

https://ml-site.cdn-apple.com/models/simplefold/cameo22_predictions.zip # predicted structures of CAMEO22
https://ml-site.cdn-apple.com/models/simplefold/casp14_predictions.zip  # predicted structures of CASP14
https://ml-site.cdn-apple.com/models/simplefold/apo_predictions.zip     # predicted structures of Apo
https://ml-site.cdn-apple.com/models/simplefold/codnas_predictions.zip  # predicted structures of Fold-switch (CoDNaS)

We use the docker image of openstructure 2.9.1 to evaluate generated structures for folding tasks (i.e., CASP14/CAMEO22). Once having the docker image enabled, you can run evaluation via:

python src/simplefold/evaluation/analyze_folding.py \
    --data_dir [PATH_TO_TARGET_MMCIF] \
    --sample_dir [PATH_TO_PREDICTED_MMCIF] \
    --out_dir [PATH_TO_OUTPUT] \
    --max-workers [NUMBER_OF_WORKERS]

To evaluate results of two-state prediction (i.e., Apo/CoDNaS), one need to compile the TMsore and then run evaluation via:

python src/simplefold/evaluation/analyze_two_state.py \ 
    --data_dir [PATH_TO_TARGET_DATA_DIRECTORY] \
    --sample_dir [PATH_TO_PREDICTED_PDB] \
    --tm_bin [PATH_TO_TMscore_BINARY] \
    --task apo \ # choose from apo and codnas
    --nsample 5

You can also train or tune SimpleFold on your end. Instructions below include details for SimpleFold training.

SimpleFold is trained on joint datasets including experimental structures from PDB, as well as distilled predictions from AFDB SwissProt and AFESM. Target lists of filtered SwissProt and AFESM targets thta are used in our training can be found:

https://ml-site.cdn-apple.com/models/simplefold/swissprot_list.csv # list of filted SwissProt (~270K targets)
https://ml-site.cdn-apple.com/models/simplefold/afesm_list.csv # list of filted AFESM targets (~1.9M targets)
https://ml-site.cdn-apple.com/models/simplefold/afesme_dict.json # list of filted extended AFESM (AFESM-E) (~8.6M targets)

In afesme_dict.json, the data is stored in the following structure:

{
    cluster 1 ID: {"members": [protein 1 ID, protein 2 ID, ...]},
    cluster 2 ID: {"members": [protein 1 ID, protein 2 ID, ...]},
    ...
}

Of course, one can use own customized datasets to train or tune SimpleFold models. Instructions below list how to process the dataset for SimpleFold training.

To process downloaded mmcif files, you need Redis installed and launch the Redis server:

wget https://boltz1.s3.us-east-2.amazonaws.com/ccd.rdb
redis-server --dbfilename ccd.rdb --port 7777

You can then process mmcif files to input format for SimpleFold:

python src/simplefold/process_mmcif.py \
    --data_dir [MMCIF_DIR]   # directory of mmcif files
    --out_dir [OUTPUT_DIR]   # directory of processed targets
    --use-assembly

The configuration of model is based on Hydra. An example training configuration can be found in configs/experiment/train. To change dataset and model settings, one can refer to config files in configs/data and configs/model. To initiate SimpleFold training:

python train experiment=train

To train SimpleFold with FSDP strategy:

python train_fsdp.py experiment=train_fsdp

If you found this code useful, please cite the following paper:

@article{simplefold,
  title={SimpleFold: Folding Proteins is Simpler than You Think},
  author={Wang, Yuyang and Lu, Jiarui and Jaitly, Navdeep and Susskind, Josh and Bautista, Miguel Angel},
  journal={arXiv preprint arXiv:2509.18480},
  year={2025}
}

Our codebase is built using multiple opensource contributions, please see ACKNOWLEDGEMENTS for more details.

Please check out the repository LICENSE before using the provided code and LICENSE_MODEL for the released models.


Read the original article

Comments

  • By hashta 2025-09-2619:323 reply

    One caveat that’s easy to miss: the "simple" model here didn’t just learn folding from raw experimental structures. Most of its training data comes from AlphaFold-style predictions. Millions of protein structures that were themselves generated by big MSA-based and highly engineered models.

    It’s not like we can throw away all the inductive biases and MSA machinery, someone upstream still had to build and run those models to create the training corpus.

    • By aDyslecticCrow 2025-09-2619:561 reply

      What i take away is the simplicity and scaling behavior. The ML field often sees an increase in module complexity to reach higher scores, and then a breakthrough where a simple model performs on-par with the most complex. That such a "simple" architecture works this well on its own, means we can potentially add back the complexity again to reach further. Can we add back MSA now? where will that take us?

      My rough understanding of field is that a "rough" generative model makes a bunch of decent guesses, and more formal "verifiers" ensure they abide by the laws of physics and geometry. The AI reduce the unfathomably large search-space so the expensive simulation doesn't need to do so much wasted work on dead-ends. If the guessing network improves, then the whole process speeds up.

      - I'm recalling the increasingly complex transfer functions in redcurrant networks,

      - The deep pre-processing chains before skip forward layers.

      - The complex normalization objectives before Relu.

      - The convoluted multi-objective GAN networks before diffusion.

      - The complex multi-pass models before full-convolution networks.

      So basically, i'm very excited by this. Not because this itself is an optimal architecture, but precisely because it isn't!

      • By nextos 2025-09-2620:26

        > Can we add back MSA now?

        Using MSAs might be a local optimum. ESM showed good performance on some protein problems without MSAs. MSAs offer a nice inductive bias and better average performance. However, the cost is doing poorly on proteins where MSAs are not accurate. These include B and T cell receptors, which are clinically very relevant.

        Isomorphic Labs, Oxford, MRC, and others have started the OpenBind Consortium (https://openbind.uk) to generate large-scale structure and affinity data. I believe that once more data is available, MSAs will be less relevant as model inputs. They are "too linear".

    • By mapmeld 2025-09-2619:43

      And AlphaFold was validated with experimental observation of folded proteins using X-rays

    • By godelski 2025-09-2619:511 reply

      Is this so unusual? Almost everything that is simple was once considered complex. That's the thing about emergence, you have to go through all the complexities first to find the generalized and simpler formulations. It should be obvious that things in nature run off of relatively simple rulesets, but it's like looking at a Game of Life and trying to reverse engineer those rules AND the starting parameters. Anyone telling you such a task is easy is full of themselves. But then again, who seriously believes that P=NP?

      • By hashta 2025-09-2620:411 reply

        To people outside the field, the title/abstract can make it sound like folding is just inherently simple now, but this model wouldn’t exist without the large synthetic dataset produced by the more complex AF. The "simple" architecture is still using the complex model indirectly through distillation. We didn’t really extract new tricks to design a simpler model from scratch, we shifted the complexity from the model space into the data space (think GPT-5 => GPT-5-mini, there’s no GPT-5-mini without GPT-5)

        • By stavros 2025-09-2621:15

          But this is just a detail, right? If we went and painstakingly catalogued millions of proteins, we'd be able to use the simple model without needing a complex model to generated data, no?

  • By stephenpontes 2025-09-2618:324 reply

    I remember first hearing about protein folding with the Folding @Home project (https://foldingathome.org) back when I had a spare media server and energy was cheap (free) in my college dorm. I'm not knowledgable on this, but have we come a long way in terms of making protein folding simpler on today's hardware, or is this only applicable to certain types of problems?

    It seems like the Folding @Home project is still around!

    • By roughly 2025-09-2619:46

      As I understand it, folding at home was a physics based simulation solver, whereas alphafold and its progeny (including this) are statistical methods. The statistical methods are much, much cheaper computationally, but rely on existing protein folds and can’t generate strong predictions for proteins that don’t have some similarities to proteins in their training set.

      In other words, it’s a different approach that trades off versatility for speed, but that trade off is significant enough to make it viable to generate protein folds for really any protein you’re interested in - it moves folding from something that’s almost computationally infeasible for most projects to something that you can just do for any protein as part of a normal workflow.

    • By _joel 2025-09-2619:032 reply

      Yep, that and SETI@Home. I loved the eye candy, even if I didn't know what it fully meant.

    • By ge96 2025-09-2620:38

      I contributed a lot on there too used my 3080Ti-FE as a small heater in the winter

    • By nkjoep 2025-09-2618:50

      Team F@H forever!

  • By vbarrielle 2025-09-2619:19

    A paper that says: "our approach is simpler than the state of the art". But also does not loudly say "our approach is significantly behind the state of the art on all metrics". Not easy to get published, but I guess putting it as a preprint with a big company's name will help...

HackerNews