Pi-Autoresearch

2026-03-138:4840github.com

Autonomous experiment loop extension for pi. Contribute to davebcn87/pi-autoresearch development by creating an account on GitHub.

Install Β· Usage Β· How it works

Try an idea, measure it, keep what works, discard what doesn't, repeat forever.

Inspired by karpathy/autoresearch. Works for any optimization target: test speed, bundle size, LLM training, build times, Lighthouse scores.

pi-autoresearch dashboard

Extension Tools + live widget + /autoresearch dashboard
Skill Gathers what to optimize, writes session files, starts the loop
Tool Description
init_experiment One-time session config β€” name, metric, unit, direction
run_experiment Runs any command, times wall-clock duration, captures output
log_experiment Records result, auto-commits, updates widget and dashboard
  • Status widget β€” always visible above the editor: πŸ”¬ autoresearch 12 runs 8 kept β”‚ best: 42.3s
  • /autoresearch β€” full results dashboard (Ctrl+X to toggle, Escape to close)

autoresearch-create asks a few questions (or infers from context) about your goal, command, metric, and files in scope β€” then writes two files and starts the loop immediately:

File Purpose
autoresearch.md Session document β€” objective, metrics, files in scope, what's been tried. A fresh agent can resume from this alone.
autoresearch.sh Benchmark script β€” pre-checks, runs the workload, outputs METRIC name=number lines.
autoresearch.checks.sh (optional) Backpressure checks β€” tests, types, lint. Runs after each passing benchmark. Failures block keep.
pi install https://github.com/davebcn87/pi-autoresearch
Manual install
cp -r extensions/pi-autoresearch ~/.pi/agent/extensions/
cp -r skills/autoresearch-create ~/.pi/agent/skills/

Then /reload in pi.

/skill:autoresearch-create

The agent asks about your goal, command, metric, and files in scope β€” or infers them from context. It then creates a branch, writes autoresearch.md and autoresearch.sh, runs the baseline, and starts looping immediately.

The agent runs autonomously: edit β†’ commit β†’ run_experiment β†’ log_experiment β†’ keep or revert β†’ repeat. It never stops unless interrupted.

Every result is appended to autoresearch.jsonl in your project β€” one line per run. This means:

  • Survives restarts β€” the agent can resume a session by reading the file
  • Survives context resets β€” autoresearch.md captures what's been tried so a fresh agent has full context
  • Human readable β€” open it anytime to see the full history
  • Branch-aware β€” each branch has its own session
  • Widget β€” always visible above the editor
  • /autoresearch β€” full dashboard with results table and best run
  • Escape β€” interrupt anytime and ask for a summary
Domain Metric Command
Test speed seconds ↓ pnpm test
Bundle size KB ↓ pnpm build && du -sb dist
LLM training val_bpb ↓ uv run train.py
Build speed seconds ↓ pnpm build
Lighthouse perf score ↑ lighthouse http://localhost:3000 --output=json

The extension is domain-agnostic infrastructure. The skill encodes domain knowledge. This separation means one extension serves unlimited domains.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Extension (global)  β”‚     β”‚  Skill (per-domain)       β”‚
β”‚                      β”‚     β”‚                           β”‚
β”‚  run_experiment      │◄────│  command: pnpm test       β”‚
β”‚  log_experiment      β”‚     β”‚  metric: seconds (lower)  β”‚
β”‚  widget + dashboard  β”‚     β”‚  scope: vitest configs    β”‚
β”‚                      β”‚     β”‚  ideas: pool, parallel…   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Two files keep the session alive across restarts and context resets:

autoresearch.jsonl   β€” append-only log of every run (metric, status, commit, description)
autoresearch.md      β€” living document: objective, what's been tried, dead ends, key wins

A fresh agent with no memory can read these two files and continue exactly where the previous session left off.

Create autoresearch.checks.sh to run correctness checks (tests, types, lint) after every passing benchmark. This ensures optimizations don't break things.

#!/bin/bash
set -euo pipefail
pnpm test --run
pnpm typecheck

How it works:

  • If the file doesn't exist, everything behaves exactly as before β€” no changes to the loop.
  • If it exists, it runs automatically after every benchmark that exits 0.
  • Checks execution time does not affect the primary metric.
  • If checks fail, the experiment is logged as checks_failed (same behavior as a crash β€” no commit, revert changes).
  • The checks_failed status is shown separately in the dashboard so you can distinguish correctness failures from benchmark crashes.
  • Checks have a separate timeout (default 300s, configurable via checks_timeout_seconds in run_experiment).

MIT


Read the original article

Comments

HackerNews