programmatic CAD benchmark

benchcad: a benchmark for programmatic cad

17,900 execution-verified CadQuery programs across 106 industrial families, anchored in 47 ISO / DIN / EN / ASME / IEC standards. Graded by voxel IoU and ratio accuracy — no LLM judge.

4 matched tasks · 46 CadQuery operations

view the leaderboard run the benchmark 🤗 dataset

BenchCAD · part #014

bevel_gear — a dimensioned BenchCAD part

Ø 86.0

120.0

partbevel_gear

scale1:1

rev0.1

trusted by Anthropic

leaderboard · Vision2Code

how the frontier scores

Image-to-CadQuery, ranked by composite score. Even the best frontier model barely clears 0.35 IoU — the benchmark is far from saturated.

#	Model	Org	IoU ↑	total ↑
1	Gemini 3.1 Pro thinking	Google	0.3546	0.3457
2	Gemini 3.1 Pro	Google	0.3140	0.3315
3	Claude Opus 4.7 thinking	Anthropic	0.2790	0.3238
4	Claude Opus 4.7	Anthropic	0.2740	0.3075
5	Claude Sonnet 4.6 thinking-high	Anthropic	0.2567	0.2979
6	GPT-5.3 thinking	OpenAI	0.2187	0.2497

view the full leaderboard — all 4 tasks

examples

what models are asked to build

Each part ships with ground-truth CadQuery, STEP, mesh, a 4-view render, and numeric QA — drawn from 106 standard-anchored families.

twisted_drill

impeller

bevel_gear

coil_spring

heat_sink

t_pipe_fitting

spline_hub

double_simplex_sprocket

browse the dataset

capability-decomposed

four matched tasks

The same parts, queried four ways — so a failure traces to visual recognition, parametric abstraction, or code synthesis rather than one lumped number.

img2cq ↗

Vision2Code

Four orthographic views → a CadQuery program, re-executed and scored by IoU, Chamfer, essential-op recall, and exec rate.

qa_img ↗

Vision QA

Numeric geometric reasoning from rendered views, broken out across the L1–L4 capability hierarchy.

qa_code ↗

Code QA

The same questions conditioned on CadQuery source — the matched-pair gap isolates seeing from reading.

edit_code ↗

Code Edit

Instruction-guided program editing, scored by headroom-normalised improvement over the original→target gap.