17,900 execution-verified CadQuery programs across 106 industrial families, anchored in 47 ISO / DIN / EN / ASME / IEC standards. Graded by voxel IoU and ratio accuracy — no LLM judge.

Sort and filter 50+ re-graded results — never self-reported — by org, class, and metric.
Best Vision2Code score is just 0.35 IoU — GPT-5.3, Gemini 3.1 Pro, and Claude Opus 4.7 all struggle to reproduce industrial geometry.
Image-to-CadQuery, ranked by composite score. Even the best frontier model barely clears 0.35 IoU — the benchmark is far from saturated.
| # | Model | Org | IoU ↑ | total ↑ |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro thinking | 0.3546 | 0.3457 | |
| 2 | Gemini 3.1 Pro | 0.3140 | 0.3315 | |
| 3 | Claude Opus 4.7 thinking | Anthropic | 0.2790 | 0.3238 |
| 4 | Claude Opus 4.7 | Anthropic | 0.2740 | 0.3075 |
| 5 | Claude Sonnet 4.6 thinking-high | Anthropic | 0.2567 | 0.2979 |
| 6 | GPT-5.3 thinking | OpenAI | 0.2187 | 0.2497 |
Each part ships with ground-truth CadQuery, STEP, mesh, a 4-view render, and numeric QA — drawn from 106 standard-anchored families.








The same parts, queried four ways — so a failure traces to visual recognition, parametric abstraction, or code synthesis rather than one lumped number.
Four orthographic views → a CadQuery program, re-executed and scored by IoU, Chamfer, essential-op recall, and exec rate.
Numeric geometric reasoning from rendered views, broken out across the L1–L4 capability hierarchy.
The same questions conditioned on CadQuery source — the matched-pair gap isolates seeing from reading.
Instruction-guided program editing, scored by headroom-normalised improvement over the original→target gap.