TaxCalcBench (Column Tax): compute a complete 2024 Form 1040 from taxpayer data and score it against the IRS-format ground-truth return. 51 returns, 19 scored Form 1040 lines per return (969 line-level checks). Strict = exact match; lenient = within $5.
src/tax1040), finalized through the agent path| System | Reasoning | Returns exact | Returns ±$5 | By line | By line ±$5 |
|---|---|---|---|---|---|
| cpa-copilot (gpt-5.5) | low | 62.7% | 78.4% | 91.4% | 94.8% |
| GPT-5.4 Pro | ultrathink (xhigh) | 62.75% | 72.55% | 89.99% | 93.40% |
| GPT-5.4 | ultrathink (xhigh) | 62.75% | 66.67% | 89.78% | 91.12% |
| Claude Opus 4.6 | ultrathink | 52.94% | 64.71% | 87.00% | 89.16% |
| Gemini 3.1 Pro | ultrathink | 49.02% | 68.63% | 88.54% | 92.16% |
| GPT-5 w/ Web Search | high | 41.67% | 54.41% | 83.90% | 87.64% |
| Claude Sonnet 4.6 | ultrathink | 37.25% | 56.86% | 84.21% | 88.65% |
| Gemini 3 Pro | high | 36.27% | 73.53% | 85.42% | 93.83% |
| Gemini 2.5 Pro | lobotomized | 32.35% | 51.96% | 81.22% | 86.12% |
| GPT-5 | high | 31.86% | 54.41% | 81.45% | 86.09% |
| Claude Sonnet 4.5 | ultrathink | 31.37% | 51.47% | 81.17% | 85.81% |
| Claude Haiku 4.5 | ultrathink | 13.24% | 39.22% | 73.94% | 80.93% |