Autoresearch: Agents researching on single-GPU nanochat training automatically

2026-03-0720:2220658github.com

AI agents running research on single-GPU nanochat training automatically - karpathy/autoresearch

teaser

One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. -@karpathy, March 2026.

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of nanochat. The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org. The default program.md in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this tweet.

The repo is deliberately kept small and only really has a three files that matter:

  • prepare.py — fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and runtime utilities (dataloader, evaluation). Not modified.
  • train.py — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. This file is edited and iterated on by the agent.
  • program.md — baseline instructions for one agent. Point your agent here and let it go. This file is edited and iterated on by the human.

By design, training runs for a fixed 5-minute time budget (wall clock, excluding startup/compilation), regardless of the details of your compute. The metric is val_bpb (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared.

Requirements: A single NVIDIA GPU (tested on H100), Python 3.10+, uv.

# 1. Install uv project manager (if you don't already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh # 2. Install dependencies
uv sync # 3. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py # 4. Manually run a single training experiment (~5 min)
uv run train.py

If the above commands all work ok, your setup is working and you can go into autonomous research mode.

Platforms support. This code currently requires that you have a single NVIDIA GPU. In principle it is quite possible to support CPU, MPS and other platforms but this would also bloat the code. I'm not 100% sure that I want to take this on personally right now. The code is just a demonstration and I don't know how much I'll support it going forward. People can reference (or have their agents reference) the full/parent nanochat repository that has wider platform support and shows the various solutions (e.g. a Flash Attention 3 kernels fallback implementation, generic device support, autodetection, etc.), feel free to create forks or discussions for other platforms and I'm happy to link to them here in the README in some new notable forks section or etc.

Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like:

Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.

The program.md file is essentially a super lightweight "skill".

prepare.py      — constants, data prep + runtime utilities (do not modify)
train.py        — model, optimizer, training loop (agent modifies this)
program.md      — agent instructions
pyproject.toml  — dependencies
  • Single file to modify. The agent only touches train.py. This keeps the scope manageable and diffs reviewable.
  • Fixed time budget. Training always runs for exactly 5 minutes, regardless of your specific platform. This means you can expect approx 12 experiments/hour and approx 100 experiments while you sleep. There are two upsides of this design decision. First, this makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc). Second, this means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms.
  • Self-contained. No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.

MIT


Read the original article

Comments

  • By mikert89 2026-03-081:105 reply

    As ai improves, most tasks will become something like this. Environments setup where the model learns through trial and error

    Any human endeavor that can be objectively verified in some environment like this can be completely automated

    • By NitpickLawyer 2026-03-085:491 reply

      What's really interesting is that the LLMs become better and better at setting up the environments / tasks themselves. I got this surreal experience the other day where I was writing a prompt0n.md file (I try to log all my prompts in a .folder to keep track of what I prompt and the results I get), and the autocomplete in antigravity kinda sorta wrote the entire prompt by itself... Granted it had all the previous prompts in the same folder (don't know exactly what it grabs in context by itself) and I was working on the next logical step, but it kept getting the "good bits" out of them, and following the pattern quite nicely. I only edited minor things, and refused one line completion in the entire prompt.

      • By cubefox 2026-03-087:011 reply

        It's probably not long till frontier AI companies automate AI research. Then we get recursive self-improvement and eventually superintelligence. The singularity is near. Only a few years perhaps.

        • By aaa_aaa 2026-03-087:223 reply

          Forgot the /s

          • By vidarh 2026-03-0819:07

            I'm currently working on a project that is self-improving most of the time. Most of the plans for next steps are written by the agent itself, and executed by the agent itself, and the result feeds into choosing which plans to pursue next. It's not 100% autonomous yet, but self-improvement loops are real, and essential to getting the most out of AI.

          • By 10xDev 2026-03-0813:461 reply

            AI currently lacks agency but if it can achieve greater goal setting and agency I can't see why self-improvement could not be achieved.

            I think the most disappointing thing will be that even we do achieve ASI, everything will carry on as business as usual for a while before it starts making an economic impact because of how resistant to change we have made society.

            • By Lerc 2026-03-0814:06

              This is something that I have been wondering about. SuperIntelligence or not, it's clear that significant change is going to happen.

              There are a lot of people working on the cause of the change. There are a lot of people criticising the nature of the change. There are a lot of people rejecting the change.

              How many are there preparing the world for the change?

              Some form of change is coming, how are we preparing society to deal with what is happening?

              Job losses due to technology have happened over and over again. Rendering particular forms of employment redundant (typing pools, clearing horse manure, Video rental store workers, and of course, the loom). Most agree that the world is better when those are jobs that need to be done. It's the livelihood of the workers that is the concern.

              Instead of fighting the change we need to address the inevitability of change the responsibility to those who it will affect.

          • By cubefox 2026-03-087:30

            Short for /superintelligence.

    • By jononor 2026-03-1220:57

      Many "subjective" tasks can also be done in an "objective" manner - as long as there is a large enough dataset to estimate what humans would evaluate the outputs - and the evaluators being reasonably consistent. Many human preferences are relatively homogeneous, or sometimes clustered into groups. And there are whole fields of study/practice of such phenomena, such as sensory science - with applications in food, audio, images etc.

    • By miki123211 2026-03-089:011 reply

      So much this.

      People make fun of prompt engineering, but I think "AI ops" will eventually become a real role at most if not all software companies. Harness Engineers and Agent Reliability Engineers will be just as important as something like DevOps is now.

      • By 10xDev 2026-03-0813:421 reply

        Prompt engineering is already dying. AI has become great at inferring what you mean even without being incredibly explicit and creates its own detailed plan to follow. Harnesses will also be developed by AI.

    • By vrighter 2026-03-1018:20

      it's called reinforcement learning

    • By wiz21c 2026-03-089:001 reply

      don't forget the size of the search space...

      • By mikert89 2026-03-0813:571 reply

        this is why big tech is spending 500B on GPUs

        • By vrighter 2026-03-1018:22

          that they don't even have the datacenters to plug them in, not the power generation needed to run them if they did

  • By thesz 2026-03-0817:31

    This looks very much like whirlpool. LLM researcher makes LLMs researching LLMs. The quote from old post from Karpathy [1] look very appropriate here

    [1] https://karpathy.github.io/2015/05/21/rnn-effectiveness/

      "In particular, setting temperature very near zero will give the most likely thing that Paul Graham might say:
        “is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same”
      looks like we’ve reached an infinite loop about startups."
    
    As if Karpathy made an artificial Karpathy-researcher-blogger and set temperature close to zero.

  • By daxfohl 2026-03-0821:591 reply

    Once this can run on stock hardware, set the goal to be replicating to other machines. You get a nice, massively parallel, intelligent guided evolution algorithm for malware. It could even "learn" how to evade detection, how to combine approaches of existing viruses, how to research attack methods, how to identify and exploit vulnerabilities in open source libraries, how to phish, how to blackmail, etc. Maybe even learns how to coordinate attacks with other instances of itself or "publish" new attacks on some encrypted feed it creates. Who knows, maybe it becomes so rampant that instances have to start fighting each other for compute resources. Or maybe eventually one branch becomes symbiotic with humans to fight off their enemies, etc.

    • By jononor 2026-03-1220:46

      Number of machines under control is a measureable target. Quite suited for this concept, at least in theory.

HackerNews