Project 02 · Case study

Parameter Golf Autoresearch

OpenAI Model Craft Challenge: built an autonomous research agent that runs hyperparameter sweeps, scores candidates against the frontier, and decides what to test next. RTX 4060 for iteration, 8xH100 via RunPod for finals.

PythonPyTorchCUDARunPodH100RTX 4060Hyperparameter OptimizationSmall LM Training

Case study

The problem

OpenAI's Model Craft Challenge: fit the best language model in 16MB, train it in 10 minutes on 8xH100s, scored by bits-per-byte on FineWeb. The frontier moves constantly as participants submit improvements. Manual experimentation can't keep up.

The approach

Built an autonomous research agent. Runs hyperparameter sweeps locally on an RTX 4060 for cheap iteration. Scores each candidate technique against the running PR-level frontier (#338, #398, #413). Maintains a battle plan with confirmed BPB deltas per technique. Promote winning configs to 8xH100 via RunPod for full training. Techniques explored: Value Residual, Gated Attention, Partial RoPE, EMA, aggressive TTT, BigramHash, SmearGate, Int6 QAT.

What worked

The autoresearch loop itself is the output. The agent watches the frontier shift and decides what to test next, avoiding wasted compute on techniques that the latest leaderboard entries already obsolete.

What I'd do differently

Local proxy scoring on an RTX 4060 doesn't perfectly predict 8xH100 training dynamics. The gap between local sweep results and full-scale runs was the hardest calibration problem.

More detail

OpenAI's challenge: train the best language model that fits in 16MB and finishes in 10 minutes on 8xH100s, scored by bits-per-byte on FineWeb. I built an autonomous research agent that runs hyperparameter sweeps locally on an RTX 4060 (cheap iteration), scores each candidate technique against the running PR-level frontier, and maintains a battle plan with confirmed BPB deltas per technique: Value Residual, Gated Attention, Partial RoPE, EMA, aggressive TTT, BigramHash, SmearGate, Int6 QAT, and others. Promote winning configs to 8xH100 via RunPod for full training. The value here is the autoresearch loop itself — an agent that watches the frontier shift and decides what to test next.