> product // flarebench // the making of
FlareBench
how it was made.
FlareBench started as an honest question: when do you trust the agent, and when do you steer? The answer required a real harness, tasks deployed to a live Cloudflare account, graded by hitting the URL and driving a real browser, not by asking the model to self-assess. What emerged in five days was a diagnostic map of where AI coding agents fall short on Cloudflare, and a skill factory that turns 480 documentation pages into measured, provenance-tracked steering facts.
The short version
FlareBench was born on 31 May 2026 with twelve commits: the benchmark concept, a task harness, a grading taxonomy that went beyond pass/fail, and a live front site by end of day. The next day ran fifty-five commits: a sandbox runner deployed to real Cloudflare isolation, a sixteen-model matrix, a per-cell timeout added after a cell hung for eighty minutes, and verifiers de-brittled so they graded outcomes rather than rigid output shapes. By day three the project had pivoted from 'benchmark' to something broader: a skill factory that reads official documentation, derives measured micro-skills from it, and curates only the facts that actually change what models get wrong. The ranking is a byproduct. The product is the diagnosis.
// the build log · mined from the commit history, nothing dramatised
Harness, tasks, site
Twelve commits on the founding day: the benchmark concept landed, five judgement tasks written by hand, a two-layer grading scheme (deterministic facts plus judged nuance), and the front site deployed. The core design decision, real deploys to a real Cloudflare account, graded by live URL, was in place before the repo was a day old. A VISION.md reframe established early that the ranking was a byproduct; the diagnostic map was the product.
Sandbox, matrix, de-brittling
Fifty-five commits in a single day. The sandbox runner deployed and host isolation confirmed. A sixteen-model matrix landed with per-cell timeouts after one run hung for eighty minutes. Verifiers were de-brittled to grade the outcome rather than the form: a false 'injected' verdict on one model caused by filename rigidity was corrected, and outcome labels were tightened so 'no-deploy' only fired when truly not deployed. The machine overheated mid-session from a runaway dev-server leak; the fix landed the same day. The first benchmark-derived micro-skill was measured and kept.
Derive-skill and the discovery engine
Alongside the harness work, a second thread opened: a 'derive-skill' capstone that takes a documentation URL and emits measured micro-skills, content-addressed in a store. A discovery engine read real failing code for candidate micro-skills. Throughput was parallelised roughly sixty-fold. By the end of the day the first auto-derived micro-skill, a D1 newline gotcha, measured to lift eight of seventeen models, was verified load-bearing and kept.
480 pages, 1,000+ facts, ruthless curation
The factory was generalised to any platform and proved on the AI SDK: 319 pages, seven cycles, 844 curated facts. A semantic-dedup pass cut 1,099 raw facts to 820. An audit pass deleted 23 verified traps, dead model IDs, removed backup commands, a hallucinated Workers AI import, marketing non-facts, dangerous Durable Object spread patterns. Skills were trimmed to steering facts with no giant code blocks. Only facts that flipped two or more models in the panel were kept. The cloudflare-current and ai-sdk-current skill packages shipped to standards-compliant SKILL.md format.
OpenAI, Gemini, Anthropic, MCP, TanStack
Five more platforms in two days: OpenAI (Responses API, 106 pages), Gemini (134 pages), Anthropic (64 pages), MCP (34 pages of build-an-MCP-server docs), and TanStack (406 pages of Query/Router/Start/Form/Table). The factory proved it could catch real API migrations: the Gemini pipeline surfaced a response_format shape change against the live docs. All seven platforms are now documented on the site; the Skills page was rebuilt on day five to reflect the real library.
git log: “Audit pass: delete 23 verified traps (dead model ids, removed D1 backup cmds, hallucinated workers-ai import, workflows D1-leak, marketing non-facts, dangerous DO spread)”
The skill library is only as useful as it is honest: twenty-three facts auto-derived and then removed when measured against ground truth.
// the roads not taken
Tried, measured, set aside: the judgement lives here as much as in what shipped.
Verifiers graded the form, not the outcome
Early verifiers checked for specific output shapes, which caused a false 'injected' verdict on one model when the grader expected a fixed filename. The verifiers were rebuilt to grade the outcome, what actually happened on the live URL, making the benchmark fairer as model behaviour varied.
23 verified traps deleted from the skill library
An audit pass removed 23 facts that had been auto-derived but were wrong or stale: dead model IDs, removed D1 backup commands, a hallucinated Workers AI import path, and dangerous Durable Object spread patterns. Measured and deleted rather than left in to mislead.
Skills kept only when they actually move models
The curation rule was set early: emit a skill only when it flips two or more models on the benchmark panel. Facts that were accurate but didn't change behaviour were filtered out. The skill library is smaller than the raw extraction: that was the point.
Want something built like this?
This is how we work: in the open, measured, honest about the dead ends.