cpa-copilot

TaxCalcBench results

Benchmark

TaxCalcBench (Column Tax): compute a complete 2024 Form 1040 from taxpayer data and score it against the IRS-format ground-truth return. 51 returns, 19 scored Form 1040 lines per return (969 line-level checks). Strict = exact match; lenient = within $5.

Configuration

Model: gpt-5.5, low reasoning
Method: deterministic 2024 Form 1040 calculator (src/tax1040), finalized through the agent path
Tools: code interpreter only; no retrieval, no web search
Override rule: agent line values accepted only within $1 of the calculated value
Run: full 51-case run, 2026-06-01

Results

62.7%

Returns exact (32 / 51)

78.4%

Returns within $5 (40 / 51)

91.4%

Lines exact

19s

Median per return

Comparison

TaxCalcBench TY24, 51 returns. Published scores are each model's best across a thinking-budget sweep (Column Tax leaderboard); Reasoning shows the level that produced the best strict-return score (lobotomized→low→medium→high→ultrathink, where ultrathink is the model's max budget — OpenAI's “xhigh”). Ties shown at the higher budget.
System	Reasoning	Returns exact	Returns ±$5	By line	By line ±$5
cpa-copilot (gpt-5.5)	low	62.7%	78.4%	91.4%	94.8%
GPT-5.4 Pro	ultrathink (xhigh)	62.75%	72.55%	89.99%	93.40%
GPT-5.4	ultrathink (xhigh)	62.75%	66.67%	89.78%	91.12%
Claude Opus 4.6	ultrathink	52.94%	64.71%	87.00%	89.16%
Gemini 3.1 Pro	ultrathink	49.02%	68.63%	88.54%	92.16%
GPT-5 w/ Web Search	high	41.67%	54.41%	83.90%	87.64%
Claude Sonnet 4.6	ultrathink	37.25%	56.86%	84.21%	88.65%
Gemini 3 Pro	high	36.27%	73.53%	85.42%	93.83%
Gemini 2.5 Pro	lobotomized	32.35%	51.96%	81.22%	86.12%
GPT-5	high	31.86%	54.41%	81.45%	86.09%
Claude Sonnet 4.5	ultrathink	31.37%	51.47%	81.17%	85.81%
Claude Haiku 4.5	ultrathink	13.24%	39.22%	73.94%	80.93%