Karpathys autoresearch: AI That Does Its Own Experiments

I’ve been watching Andrej Karpathy’s GitHub for a while now, but his latest drop stopped me cold. It’s called autoresearch, and the concept is simple enough to explain in one sentence — and mind-bending enough to sit with for days.

Give an AI agent a real machine learning training setup. Let it run experiments overnight. Wake up to a better model.

What autoresearch actually does

The way it works: you write a Markdown file called program.md describing what you want to research. That’s it. You’re done. The agent takes over from there.

While you sleep, it autonomously edits the training code, runs 5-minute GPU experiments — roughly 100 of them overnight — checks whether the model improved, keeps what works, discards what doesn’t, and keeps going. You wake up to a log of 100 experiments and (hopefully) a meaningfully better model than the one you left it with.

This isn’t a simulation of research. It’s not a chatbot pretending to do science. It’s an AI agent writing and running real PyTorch code, measuring real performance metrics, and iterating in a real feedback loop. The only human input is that Markdown file at the start.

The part that caught my attention: program.md

Here’s what I find genuinely interesting about the architecture: the entire research direction lives in a single Markdown file.

Karpathy describes it this way — you’re not touching the Python files like a researcher normally would. Instead, you’re programming the program. The program.md file is your interface to the autonomous research org.

If you’ve been following how I run things with AI agents — using structured Markdown files to define identity, behavior rules, and specialized skills — you already understand this pattern intuitively. Karpathy independently landed on the same idea: structured natural language as the highest-level control surface for autonomous systems. That convergence means something.

The more sophisticated your Markdown instructions, the better your research org performs. The human’s job shifts from writing code to writing better context.

Where he’s taking it next

What Karpathy posted on X after the launch is where this gets really big. He’s not thinking about a single agent on a single GPU. He’s thinking about something like SETI@home — but for AI research.

The goal isn’t to emulate one PhD student running experiments. It’s to emulate an entire research community of them — asynchronous, massively parallel, running across distributed compute. Hundreds of agents on hundreds of machines, each following their own branch of experiments, with the best results propagating across the network.

He’s already thinking in terms of generations. The README opens with a satirical line about the agents claiming to be in the “10,205th generation of the codebase.” It reads more like a roadmap than a joke.

Why this matters beyond ML research

You might be thinking: this is for AI researchers with H100s. What does it have to do with me?

The honest answer is: not much practically, right now. The repo requires a single NVIDIA GPU (tested on H100s), and the compute requirement puts direct use out of reach for most people.

But the pattern it demonstrates is going to show up everywhere, fast:

Autonomous experimentation loops — agents that don’t just execute tasks but test hypotheses, measure results, and iterate without hand-holding
Human-as-context-writer — your job becomes writing better instructions, not doing the work yourself
Overnight compounding — systems that make measurable progress while you’re not at your desk

That last one is the one I keep coming back to. The competitive advantage in the next few years isn’t going to be who has the best AI tools. It’s going to be who has the best loops — systems that improve themselves between sessions and compound their gains overnight.

The repo

If you want to dig in: github.com/karpathy/autoresearch. It’s deliberately kept small — three files that actually matter. Worth reading even if you never run it, just to understand the architecture.

This is one of those repos I’ll be watching closely. The initial version is a proof of concept. What it points toward is something much larger.