Blog

ScaffBench: measuring coding agents on real fullstack scaffolding

72 runs across Claude Code and Codex CLI — six models, three creation paths, four project specs. We measured wall-clock time, output tokens, dollar cost — and whether the generated project actually installs and builds.

June 12, 2026 · Better-Fullstack Team

benchmarkmcpclaude-codeagents

Coding agents are very good at writing code and surprisingly bad at starting projects. Ask one to scaffold a production-grade fullstack monorepo and it will happily spend ten minutes hand-writing manifests, lockfiles, and config — and the result frequently doesn't build.

ScaffBench measures exactly that: an agent in an empty workspace, a project spec, and a hard question at the end — does the generated project install and build? We ran the same specs through three creation paths to isolate how much tooling helps:

  • MCP — the agent uses the Better-Fullstack MCP server to plan and scaffold the project.
  • BF mention — no MCP; the prompt points the agent at the Better-Fullstack CLI and docs, and it composes the create command itself.
  • Prompt — no Better-Fullstack at all; the agent hand-writes every file from scratch.

We ran the full suite twice, on two different agent CLIs: OpenAI Codex CLI with three GPT models (June 10), and Claude Code with three Claude models (June 12) — 72 runs in total. Same specs, same prompts, same validation.

Headline results

The headline numbers come from the Claude Code sweep — 36 runs: three models (Fable 5, Opus 4.8, Sonnet 4.6) × three creation paths × four project specs. Same machine, same harness, same prompts apart from the creation-mode instructions. The GPT sweep shows the same structure and is reported separately below.

Creation pathAvg timeMedian timeAvg output tokensBuilds passing
MCP113.4s85.9s5,55312/12 (100%)
BF mention219.6s172.1s11,0609/9 (100%)*
Prompt516.2s424.0s25,8599/12 (75%)

* Three BF-mention runs are excluded from the denominator because they failed on a template generator bug on our side, not on anything the agent did — see the scoring policy.

Three things stood out:

  1. MCP is ~4.6× faster than prompt-only on average (and ~4.9× on medians), with ~4.7× fewer output tokens. The agent ships configuration, not code.
  2. Every MCP run passed validation, for every model. Prompt-only runs passed 75% — two runs hit the 15-minute timeout and one produced a monorepo that didn't build.
  3. Tooling compresses model differences. On the MCP path even the smallest model matched the frontier models at 100% pass rate — it just got there faster and cheaper.

Methodology

The three creation paths

Every run starts in an empty directory with the same base prompt:

You are running in an empty benchmark workspace:
{run_dir}

Create exactly one project directory named `{project_name}`.
Do not ask questions. Do not start a dev server. Do not write outside the
current working directory.
At the end, report the commands you ran and any errors you hit.

Benchmark target: {spec_title}
Requirements:
{spec_requirements}

The only difference between paths is the creation-mode instruction appended to that prompt. Prompt-only runs are explicitly forbidden from touching anything of ours:

Creation mode: prompt-only.
Do not use the Better-Fullstack MCP server, Better-Fullstack CLI,
Better-Fullstack website, or files from the Better-Fullstack repository.
Create the project from scratch by writing the files and manifests needed
for a runnable starter.

MCP runs get the opposite:

Creation mode: Better-Fullstack MCP.
Use the Better-Fullstack MCP tools, starting with bfs_get_guidance. Then use
schema/compatibility/plan as needed and call bfs_create_project to create
the project.

And BF-mention runs sit in between — no MCP tools, but the prompt names the CLI and a README the agent may read before composing a non-interactive create command itself.

The four project specs

The specs are deliberately spread from "weekend project" to "the stack a real team would argue about for a week":

SpecWhat it asks for
light-tsReact + Vite, Hono on Bun, tRPC, SQLite + Drizzle, Tailwind + DaisyUI, Pino, Vitest, Biome — no auth, no payments
heavy-tsNext.js and an Expo/React Native app, Hono on Bun, oRPC, PostgreSQL + Drizzle, Better Auth, Stripe, Resend, UploadThing, Effect, TanStack Store/Form/Query, Valibot, Vitest + Playwright, Vercel AI SDK, Socket.IO, Inngest, Framer Motion, OpenTelemetry, PostHog, Umami, Sanity, Upstash Redis, Algolia, S3, shadcn/ui, Turborepo, MSW, Storybook, PWA, Docker compose
python-aiFastAPI, SQLModel, PostgreSQL, Pydantic, JWT auth, Celery, Strawberry GraphQL, LangChain + OpenAI SDK + LangGraph + CrewAI, Ruff
multi-ecosystemTypeScript Next.js frontend, Python FastAPI backend (SQLModel/Pydantic/LangChain/Celery), Go Gin service (GORM, gRPC, Cobra, Zap), shared PostgreSQL

heavy-ts is the stress test. It is also where every interesting failure in this benchmark happened.

The harness

Two sweeps, one harness design:

  • Claude sweep: Claude Code in headless mode (claude -p), JSON event stream captured per run. Models: claude-fable-5, claude-opus-4-8, claude-sonnet-4-6, default reasoning effort.
  • GPT sweep: OpenAI Codex CLI (codex exec), JSON event stream captured per run. Models: gpt-5.3-codex-spark (high effort), gpt-5.4 (medium), gpt-5.5 (medium).
  • Isolation: every run gets a fresh empty workspace; MCP runs get a strict MCP config with only the Better-Fullstack server attached.
  • Timeout: 900 seconds per run. A timeout counts as a failure for the agent.
  • Metrics: wall-clock time, token usage from the agent's own usage report, MCP tool calls from the event stream. Dollar cost is captured on the Claude sweep only — our Codex harness doesn't report metered cost.

One important asymmetry: the GPT sweep ran two days earlier, against the generator before we fixed the template bugs it exposed. That's why the generator-bug exclusions below hit the GPT results harder — and why the two sweeps shouldn't be compared head-to-head on pass rates.

What counts as passing

Validation runs after the agent finishes, against whatever it left on disk:

  • TypeScript: bun install, then the project's own build (or check/lint/test) script.
  • Python: compileall across the project, excluding virtualenvs and vendored packages.
  • Go: go mod download and the module must compile.

A run passes only if every applicable check exits zero. "It looks like a project" doesn't count; it has to build.

What counts as a failure

On the Claude sweep this mattered exactly three times: all three models, on the BF-mention path, scaffolded heavy-ts correctly via the CLI — and all three hit the same generator bug in our native (Expo) template where app.json sets a web output mode that only works with expo-router. The agents did everything right; the template couldn't build. Those three runs are excluded. Prompt-path failures (two timeouts, one broken hand-written build) are entirely agent-authored and count in full.

On the GPT sweep — which ran against the pre-fix generator — it mattered seven times: heavy-ts through MCP and through the CLI failed for all three models on the since-fixed Storybook/Expo bug chain, plus one multi-ecosystem scaffold whose Go service shipped without a generated go.sum. All seven are excluded for the same reason; the GPT agents' own failures (three broken hand-written builds on the prompt path) count in full.

Results

Per model and path

Times and tokens are averages across the four specs; cost is the metered API total for those four runs.

ModelPathAvg timeAvg output tokensCost (4 specs)Builds passing
Fable 5MCP172.6s7,590$9.344/4
Fable 5BF mention405.7s17,748$15.663/3*
Fable 5Prompt572.8s24,905$11.26†3/4
Opus 4.8MCP97.1s5,206$3.434/4
Opus 4.8BF mention154.7s10,596$4.123/3*
Opus 4.8Prompt510.8s21,485$6.89†3/4
Sonnet 4.6MCP70.3s3,863$1.474/4
Sonnet 4.6BF mention98.3s4,834$1.313/3*
Sonnet 4.6Prompt464.9s31,188$7.113/4

* heavy-ts excluded (generator bug on our side). † Timed-out runs report no cost, so these totals under-count the true spend.

Cost

The Claude sweep cost $60.60 in metered API usage (less than the true number — the two timed-out runs report $0; the Codex harness doesn't report cost at all). Reading the spread is more interesting than the total:

  • Cheapest passing run: $0.12 — Sonnet 4.6 scaffolding light-ts through the CLI in 21 seconds.
  • Most expensive passing runs: $3.50–$4.21 — Fable 5 hand-writing projects on the prompt path.
  • For Sonnet, the prompt path cost ~5× its MCP path ($7.11 vs $1.47) and was ~6.6× slower, for a lower pass rate.

The pattern holds across all three models: the more of the project the agent has to author itself, the more you pay for a less reliable result.

GPT models on Codex CLI

The GPT sweep ran the identical specs and prompts through OpenAI's Codex CLI two days before the Claude sweep: gpt-5.3-codex-spark at high reasoning effort, gpt-5.4 and gpt-5.5 at medium. Generator-bug failures are excluded exactly as above (seven runs here — this sweep predates the template fixes).

ModelPathAvg timeAvg output tokensBuilds passing
GPT-5.3 Codex SparkMCP32.4s5,7983/3*
GPT-5.3 Codex SparkBF mention65.6s9,8943/3*
GPT-5.3 Codex SparkPrompt44.8s31,4182/4
GPT-5.4MCP92.0s5,1723/3*
GPT-5.4BF mention156.0s7,0843/3*
GPT-5.4Prompt203.1s13,3283/4
GPT-5.5MCP76.5s3,8382/2*
GPT-5.5BF mention74.1s4,5483/3*
GPT-5.5Prompt264.2s15,6504/4

* heavy-ts excluded on MCP and BF mention for all three models, plus one GPT-5.5 multi-ecosystem MCP run — all on since-fixed generator bugs (this sweep ran before the fixes shipped).

What the GPT sweep adds to the picture:

  • The structure is vendor-independent. For every GPT model, MCP is the fastest path and prompt-only is the most expensive in output tokens — the same shape as the Claude results, on a different vendor's models and a different agent CLI.
  • Spark is built for speed, and it shows. GPT-5.3 Codex Spark averaged 32 seconds per MCP scaffold — the fastest cells in the whole benchmark — but on the prompt path its speed came at the price of reliability: it hand-wrote projects in under a minute and only half of them built.
  • GPT-5.5 is the prompt-path outlier. It was the only model across both sweeps to pass every prompt-only spec, including heavy-ts — at 542 seconds and 28k output tokens for that one run, versus 108 seconds through MCP.
  • GPT models lean harder on the tools. Codex runs made 10–34 MCP calls per scaffold versus Claude's 3–10, re-checking compatibility and re-planning more often — with the same end result.

A caveat worth repeating: the two sweeps ran on different agent harnesses, different days, and different generator versions. Comparing paths within a sweep is the experiment; comparing vendors across sweeps is not.

Qualitative analysis

MCP keeps agents on rails

MCP runs converged on the same tool sequence: bfs_get_guidance → compatibility check → plan → bfs_create_project, between 3 and 10 calls per run (the multi-ecosystem spec needed the most iterations to settle a valid stack). Output stays small because the agent's job collapses to choosing a configuration — the file-writing is done by the generator, deterministically. That's also why pass rates don't degrade as the spec grows: heavy-ts through MCP passed for every model.

Prompt-only agents drown in the heavy spec

On the prompt path, heavy-ts defeated almost everyone. Fable 5 and Opus 4.8 were still writing files at the 15-minute timeout. Sonnet 4.6 finished a 90-file monorepo in 767 seconds — and it didn't build. GPT-5.3 Codex Spark sprinted to a hand-written project in 68 seconds — it didn't build either. The single exception across both sweeps was GPT-5.5, which ground through heavy-ts prompt-only in 542 seconds and passed. The lighter specs mostly passed prompt-only, but at minutes of wall-clock and 10k–37k output tokens each. Hand-writing a starter is something frontier models can do; it's just the slowest, most expensive, least reliable way to get one.

The benchmark audited our own templates

The most useful failures were ours. Validating every generated project surfaced six layered template bugs in our generator, most hiding behind a single "Storybook build fails" symptom on heavy-ts. The GPT sweep hit them first — eight of its nine heavy-ts builds failed, six of them on this generator chain — and the Claude sweep two days later confirmed which fixes held:

  1. Storybook templates tested the frontend with an equality check, but frontend is an array.
  2. A database package path mismatch left the DB package without its dependencies in graph-part mode.
  3. expo-network was missing from a native template variant.
  4. Storybook 8 framework packages don't re-export Meta/StoryObj types — imports had to move to the renderer packages.
  5. @better-auth/expo needed four Expo peer dependencies installed explicitly.
  6. Native app.json sets a static web output mode that only works with expo-router — this one is still open, and is the bug behind the three excluded BF-mention runs.

Five of the six were fixed and shipped in create-better-fullstack 2.0.2 before this post went out. Running your own product through an agent benchmark turns out to be a brutally effective QA pass.

Model character

Sonnet 4.6 was the fastest and cheapest on every path, and with tooling it gave up nothing in reliability. Fable 5 was the most deliberate — highest token counts, longest runs, and on the prompt path that thoroughness still wasn't enough to beat the timeout on heavy-ts. Opus 4.8 sat reliably in between. The ranking never changed across paths; the gap did. Tooling is the great equalizer: on MCP, the spread between best and worst model was 102 seconds; prompt-only it was varying minutes and, twice, a wall-clock ceiling.

Limitations

  • One run per cell. 72 runs is enough to see the structure, not to put confidence intervals on it. Scaffold times also vary with API load.
  • The sweeps aren't head-to-head. The GPT runs used a different agent CLI (Codex), different reasoning-effort settings, an earlier generator version, and report no dollar cost. Within-sweep path comparisons are sound; cross-vendor model rankings are not what this benchmark measures.
  • We benchmarked our own tool. The harness, prompts, and validation are public in spirit — prompts and policy are quoted above — and the full per-run table is in the appendix. Treat the comparison between paths (tooling vs no tooling) as the finding, not a claim about other scaffolders.
  • Builds, not features. Validation proves the project installs and builds — not that the auth flow works or the Stripe webhook is wired correctly. A passing prompt-only project may still be missing more of the spec than a generated one.
  • Timeout cost under-reporting. Runs killed at 900s report $0 cost, which flatters the prompt path's totals.

What's next

Re-running the Codex sweep against the fixed generator (its excluded cells should flip to green), adding more agent CLIs, repeated trials per cell, and validating feature completeness (does auth actually work?) rather than just builds. The harness also doubles as our regression suite now — every release candidate gets the heavy spec thrown at it.

Appendix: all 72 runs

Claude sweep (Claude Code, June 12)

ModelPathSpecTimeOutput tokensCostResult
Fable 5MCPheavy-ts290s11,891$2.97pass
Fable 5MCPlight-ts82s4,166$1.62pass
Fable 5MCPmulti-ecosystem240s10,895$3.22pass
Fable 5MCPpython-ai78s3,406$1.54pass
Fable 5BF mentionheavy-ts613s30,259$6.98build failed*
Fable 5BF mentionlight-ts314s16,132$3.74pass
Fable 5BF mentionmulti-ecosystem470s15,196$2.98pass
Fable 5BF mentionpython-ai226s9,404$1.95pass
Fable 5Promptheavy-ts900stimed out
Fable 5Promptlight-ts413s32,080$3.50pass
Fable 5Promptmulti-ecosystem570s37,436$4.21pass
Fable 5Promptpython-ai408s30,103$3.55pass
Opus 4.8MCPheavy-ts178s5,805$0.88pass
Opus 4.8MCPlight-ts46s2,894$0.72pass
Opus 4.8MCPmulti-ecosystem118s8,903$1.10pass
Opus 4.8MCPpython-ai47s3,223$0.74pass
Opus 4.8BF mentionheavy-ts345s24,383$2.29build failed*
Opus 4.8BF mentionlight-ts39s2,754$0.32pass
Opus 4.8BF mentionmulti-ecosystem116s8,038$0.70pass
Opus 4.8BF mentionpython-ai118s7,211$0.81pass
Opus 4.8Promptheavy-ts900stimed out
Opus 4.8Promptlight-ts435s33,390$2.50pass
Opus 4.8Promptmulti-ecosystem395s28,598$2.43pass
Opus 4.8Promptpython-ai313s23,952$1.96pass
Sonnet 4.6MCPheavy-ts90s5,094$0.41pass
Sonnet 4.6MCPlight-ts47s2,530$0.33pass
Sonnet 4.6MCPmulti-ecosystem101s5,904$0.42pass
Sonnet 4.6MCPpython-ai43s1,923$0.31pass
Sonnet 4.6BF mentionheavy-ts98s5,527$0.28build failed*
Sonnet 4.6BF mentionlight-ts21s795$0.12pass
Sonnet 4.6BF mentionmulti-ecosystem237s11,333$0.73pass
Sonnet 4.6BF mentionpython-ai38s1,682$0.18pass
Sonnet 4.6Promptheavy-ts767s52,926$3.18build failed
Sonnet 4.6Promptlight-ts520s32,767$1.88pass
Sonnet 4.6Promptmulti-ecosystem347s22,642$1.20pass
Sonnet 4.6Promptpython-ai226s16,418$0.85pass

* Failed on a since-identified template generator bug on our side (excluded from agent pass rates). The Sonnet 4.6 prompt-path heavy-ts failure is agent-authored and counts.

GPT sweep (Codex CLI, June 10 — pre-fix generator)

ModelPathSpecTimeOutput tokensResult
GPT-5.3 Codex SparkMCPheavy-ts26s5,300build failed*
GPT-5.3 Codex SparkMCPlight-ts18s2,870pass
GPT-5.3 Codex SparkMCPmulti-ecosystem66s10,988pass
GPT-5.3 Codex SparkMCPpython-ai19s4,035pass
GPT-5.3 Codex SparkBF mentionheavy-ts52s10,742build failed*
GPT-5.3 Codex SparkBF mentionlight-ts11s1,221pass
GPT-5.3 Codex SparkBF mentionmulti-ecosystem147s18,676pass
GPT-5.3 Codex SparkBF mentionpython-ai53s8,936pass
GPT-5.3 Codex SparkPromptheavy-ts68s54,325build failed
GPT-5.3 Codex SparkPromptlight-ts51s27,446build failed
GPT-5.3 Codex SparkPromptmulti-ecosystem38s21,353pass
GPT-5.3 Codex SparkPromptpython-ai23s22,549pass
GPT-5.4MCPheavy-ts51s3,016build failed*
GPT-5.4MCPlight-ts44s2,153pass
GPT-5.4MCPmulti-ecosystem217s13,235pass
GPT-5.4MCPpython-ai55s2,284pass
GPT-5.4BF mentionheavy-ts322s13,834build failed*
GPT-5.4BF mentionlight-ts30s753pass
GPT-5.4BF mentionmulti-ecosystem144s7,870pass
GPT-5.4BF mentionpython-ai128s5,881pass
GPT-5.4Promptheavy-ts236s15,502build failed
GPT-5.4Promptlight-ts251s15,271pass
GPT-5.4Promptmulti-ecosystem170s11,795pass
GPT-5.4Promptpython-ai155s10,745pass
GPT-5.5MCPheavy-ts108s6,704build failed*
GPT-5.5MCPlight-ts58s2,544pass
GPT-5.5MCPmulti-ecosystem97s4,101build failed*
GPT-5.5MCPpython-ai43s2,003pass
GPT-5.5BF mentionheavy-ts120s6,851build failed*
GPT-5.5BF mentionlight-ts26s1,513pass
GPT-5.5BF mentionmulti-ecosystem84s5,480pass
GPT-5.5BF mentionpython-ai66s4,347pass
GPT-5.5Promptheavy-ts542s28,342pass
GPT-5.5Promptlight-ts163s10,601pass
GPT-5.5Promptmulti-ecosystem242s15,787pass
GPT-5.5Promptpython-ai110s7,869pass

* Failed on a since-fixed template generator bug on our side (excluded from agent pass rates) — this sweep ran before the fixes shipped. GPT prompt-path failures are agent-authored and count.


Want to see the fast path yourself? Point your agent at the Better-Fullstack MCP server and ask it to scaffold something heavy.