✦ Blog

Notes from the workshop.

Benchmarks, releases, and what we learn building a fullstack scaffolder — written up with the data attached.

ScaffBench 2.2: one number, three trials, no excuses

One quality-gated pass metric, three cold trials per spec, permission to self-verify. First cohort: GPT-5.6 — and the smallest model ties the flagship.

benchmarkagents

Read post

All posts

July 15, 202620 min read

ScaffBench 2.1: GPT-5.6 picks the right libraries — and builds fewer of them than GPT-5.5

OpenAI's newest models land below GPT-5.5 on prompt-only scaffolding — until max effort. The gap is dependency hallucination, not design. Twenty configs, one board.

benchmarkclaude-codeagents

June 26, 202612 min read

ScaffBench 2: nine configs, five ecosystems, and whether a model can scaffold from a prompt

ScaffBench 2 runs Opus, GPT-5.5, and free models through five hard fullstack specs to test whether prompt-only scaffolds actually build.

benchmarkclaude-codeagents

June 12, 202624 min read

ScaffBench: measuring coding agents on real fullstack scaffolding

102 runs across Claude Code, Codex CLI, Gemini CLI, Kilo, and opencode, measuring speed, cost, tokens, and whether generated projects install and build.

benchmarkmcpclaude-codeagents