cpa-copilot

TaxCalcBench results
← Back to the app

Benchmark

TaxCalcBench (Column Tax): compute a complete 2024 Form 1040 from taxpayer data and score it against the IRS-format ground-truth return. 51 returns, 19 scored Form 1040 lines per return (969 line-level checks). Strict = exact match; lenient = within $5.

Configuration

Model
gpt-5.5, low reasoning
Method
deterministic 2024 Form 1040 calculator (src/tax1040), finalized through the agent path
Tools
code interpreter only; no retrieval, no web search
Override rule
agent line values accepted only within $1 of the calculated value
Run
full 51-case run, 2026-06-01

Results

62.7%
Returns exact (32 / 51)
78.4%
Returns within $5 (40 / 51)
91.4%
Lines exact
19s
Median per return

Comparison

TaxCalcBench TY24, 51 returns. Published scores are each model's best across a thinking-budget sweep (Column Tax leaderboard); Reasoning shows the level that produced the best strict-return score (lobotomized→low→medium→high→ultrathink, where ultrathink is the model's max budget — OpenAI's “xhigh”). Ties shown at the higher budget.
System Reasoning Returns exact Returns ±$5 By line By line ±$5
cpa-copilot (gpt-5.5)low 62.7%78.4%91.4%94.8%
GPT-5.4 Proultrathink (xhigh)62.75%72.55%89.99%93.40%
GPT-5.4ultrathink (xhigh)62.75%66.67%89.78%91.12%
Claude Opus 4.6ultrathink52.94%64.71%87.00%89.16%
Gemini 3.1 Proultrathink49.02%68.63%88.54%92.16%
GPT-5 w/ Web Searchhigh41.67%54.41%83.90%87.64%
Claude Sonnet 4.6ultrathink37.25%56.86%84.21%88.65%
Gemini 3 Prohigh36.27%73.53%85.42%93.83%
Gemini 2.5 Prolobotomized32.35%51.96%81.22%86.12%
GPT-5high31.86%54.41%81.45%86.09%
Claude Sonnet 4.5ultrathink31.37%51.47%81.17%85.81%
Claude Haiku 4.5ultrathink13.24%39.22%73.94%80.93%