PARAMETER GOLF

Parameter Golf

OpenAI's official efficiency challenge: train the best language model that fits in 16 MB and trains in under 10 minutes on 8xH100. My run hit 1.1194 bits/byte — beating the baseline by stacking every quantization and optimizer trick that actually moves the needle.

View Source

“When the budget is fixed at 16 megabytes, every bit you spend has to earn its place.”

The Problem

Most language-model work assumes scale is free — more parameters, more compute, more data. The Parameter Golf challenge inverts that. You get a hard 16 MB artifact cap and a 10-minute training window on 8xH100, then you race to the lowest bits/byte on held-out FineWeb. There's nowhere to hide: a single wasted layer, a sloppy quantization scheme, or an optimizer that converges slowly blows the entire budget. The baseline sat at 1.2244 bpb, and closing that gap meant questioning every default the field takes for granted.

The Approach

I treated the model as a compression problem under constraint. Quantization-Aware Training in mixed int5/int6/int8 precision let me trade precision for parameters where the loss curve allowed it, dropping to 1-bit and ternary weights in the layers that tolerated it. BigramHash embeddings collapsed the vocabulary footprint; partial RoPE and cross-layer attention recovered representational capacity the quantization gave up. A LeakyReLU-squared activation plus LoRA test-time training squeezed out the final gains, and Parallel Muon kept the optimizer converging fast enough to land inside the 10-minute wall. EMA/SWA weight averaging stabilized the endgame.

Key Insight

The leaderboard rewards engineering judgment, not raw scale. The winning configuration — LeakyReLU squared, test-time training, and Parallel Muon — landed at 1.1194 bpb because each technique targeted a specific bottleneck the scaling laws predicted: the Muon optimizer for fast convergence inside the time wall, QAT for the size wall, and test-time adaptation for the distribution shift at eval. Efficiency isn't a smaller version of scale; it's a different optimization problem with its own frontier.

16 MB

Artifact Budget

1.1194

Bits / Byte

8xH100

Training Hardware

< 10min

Train Time Limit

How it works

The architecture behind the system.

Quantization-Aware Training

Mixed int5/int6/int8 precision trained end-to-end so the model learns to live inside its quantized weights, with 1-bit and ternary quantization in the layers that tolerate it. The core lever for staying under 16 MB.

Parallel Muon Optimizer

The Muon optimizer, parallelized to converge fast enough to land inside the 10-minute training wall on 8xH100. Faster descent per step means more effective training inside a fixed compute budget.

BigramHash Embeddings

Hash-based bigram embeddings collapse the vocabulary footprint without an explicit embedding table — reclaiming megabytes that would otherwise be spent on parameters that barely move the loss.

Partial RoPE + Cross-Layer Attention

Partial rotary position encoding and cross-layer attention recover representational capacity surrendered to aggressive quantization, keeping accuracy high while the parameter count stays tiny.

Test-Time Training

LoRA-based test-time training plus a LeakyReLU-squared activation adapt the model at evaluation, closing distribution shift on held-out FineWeb. Part of the 1.1194 bpb winning configuration.

EMA / SWA Averaging

Exponential moving average and stochastic weight averaging stabilize the final weights so the endgame converges smoothly — turning a noisy 10-minute run into a reproducible leaderboard score.

Built with

PyTorchMLXQuantization-Aware TrainingMuon OptimizerNeural Scaling LawsLoRA Test-Time TrainingFineWebPartial RoPECross-Layer AttentionEMA / SWA

See the code

Full source code available. See exactly how it's built.

View on GitHub