programmatic CAD benchmark

benchcad: a benchmark for programmatic cad

17,900 execution-verified CadQuery programs across 106 industrial families, anchored in 47 ISO / DIN / EN / ASME / IEC standards. Graded by voxel IoU and ratio accuracy — no LLM judge.

4 matched tasks · 46 CadQuery operations
BenchCAD · part #014
bevel_gear — a dimensioned BenchCAD part
Ø 86.0
120.0
partbevel_gear
scale1:1
rev0.1
trusted by Anthropic
live leaderboard
frontier models across four matched tasks

Sort and filter 50+ re-graded results — never self-reported — by org, class, and metric.

result
the frontier is far from solving CAD

Best Vision2Code score is just 0.35 IoU — GPT-5.3, Gemini 3.1 Pro, and Claude Opus 4.7 all struggle to reproduce industrial geometry.

leaderboard · Vision2Code

how the frontier scores

Image-to-CadQuery, ranked by composite score. Even the best frontier model barely clears 0.35 IoU — the benchmark is far from saturated.

# Model Org IoU  total 
1Gemini 3.1 Pro thinkingGoogle0.35460.3457
2Gemini 3.1 ProGoogle0.31400.3315
3Claude Opus 4.7 thinkingAnthropic0.27900.3238
4Claude Opus 4.7Anthropic0.27400.3075
5Claude Sonnet 4.6 thinking-highAnthropic0.25670.2979
6GPT-5.3 thinkingOpenAI0.21870.2497
examples

what models are asked to build

Each part ships with ground-truth CadQuery, STEP, mesh, a 4-view render, and numeric QA — drawn from 106 standard-anchored families.

capability-decomposed

four matched tasks

The same parts, queried four ways — so a failure traces to visual recognition, parametric abstraction, or code synthesis rather than one lumped number.

benchmark
BenchCAD
verified parts
17,900
families
106
standards
47
CadQuery ops
46
LLM judge
none