frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

https://arxiv.org/abs/2507.16126
35•handfuloflight•4h ago

Comments

ofrzeta•3h ago
"Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully compute results. ... Our experiment shows that state-of-the-art models succeed in calculating less than a third of federal income tax returns even on this simplified sample set."

Unsurprisingly. Sometimes I feel like I am in a madhouse. Or in an alchemist's laboratory.

anticensor•1h ago
Whereas almost every other country tries to make it easier to file taxes, even when the underlying tax schedule is complex.
Rudybega•1h ago
I wonder if you could dramatically improve these results with some relatively simple scaffolding and tool access.

If a ton of these mistakes are genuinely simple calculation errors, it seems like giving the models access to a calculator tool would help a fair bit.

Lionga•51m ago
The problem is they do not understand what/how to calculate not the actual act of adding or multiplying. I tried asking ChatGPT to calculate some taxes for three countries, two of which I have been filing taxes already. For the two I know ChatGPT gave wildly wrong numbers (not even right ballpark), so I know I could not trust numbers for the third which was what I was mostly interested in.
sails•48m ago
I feel like we are already there. I would imagine if you set Claude Code or Codex this task, running in the CLI, you would see a huge improvement, and that is before you start creating task specific guardrails.

I’m surprised they haven’t tried this, I’m running my own in parallel against my accountant in this way.

hodgehog11•52m ago
Am I missing something or did they only assess this on Google and Anthropic models? If so, all I can ascertain from this is that latest Gemini models outperformed Claude on this particular task, which should be surprising to no-one. What about GPT-5? Open weight models?

Steve Jobs and Cray-1 to be featured on 2026 American Innovations $1 coin

https://www.usmint.gov/news/press-releases/united-states-mint-releases-2026-american-innovation-o...
35•maguay•1h ago•16 comments

Journalists turn in access badges, exit Pentagon rather than agreeing new rules

https://apnews.com/article/pentagon-press-access-hegseth-trump-restrictions-5d9c2a63e4e03b91fc154...
255•pjmlp•1h ago•128 comments

Apple M5 chip

https://www.apple.com/newsroom/2025/10/apple-unleashes-m5-the-next-big-leap-in-ai-performance-for...
1113•mihau•19h ago•1165 comments

New coding models and integrations

https://ollama.com/blog/coding-models
74•meetpateltech•2h ago•30 comments

Claude Haiku 4.5

https://www.anthropic.com/news/claude-haiku-4-5
589•adocomplete•15h ago•218 comments

TurboTax’s 20-year fight to stop Americans from filing taxes for free (2019)

https://www.propublica.org/article/inside-turbotax-20-year-fight-to-stop-americans-from-filing-th...
167•lelandfe•3h ago•35 comments

Silver Snoopy Award

https://www.nasa.gov/space-flight-awareness/silver-snoopy-award/
44•LorenDB•3d ago•9 comments

Upcoming Rust language features for kernel development

https://lwn.net/Articles/1039073/
12•pykello•2h ago•2 comments

Zed is now available on Windows

https://zed.dev/blog/zed-for-windows-is-here
382•meetpateltech•16h ago•206 comments

Free applicatives, the handle pattern, and remote systems

https://exploring-better-ways.bellroy.com/free-applicatives-the-handle-pattern-and-remote-systems...
50•_jackdk_•5h ago•8 comments

Flies keep landing on North Sea oil rigs

https://theconversation.com/thousands-of-flies-keep-landing-on-north-sea-oil-rigs-then-taking-off...
47•speckx•5d ago•5 comments

Build a Superscalar 8-Bit CPU (YouTube Playlist) [video]

https://www.youtube.com/watch?v=bwjMLyBU4RU&list=PLyR4neQXqQo5nPdEiMbaEJxWiy_UuyNN4&index=1
66•lrsjng•5d ago•7 comments

TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

https://arxiv.org/abs/2507.16126
35•handfuloflight•4h ago•6 comments

Are hard drives getting better?

https://www.backblaze.com/blog/are-hard-drives-getting-better-lets-revisit-the-bathtub-curve/
202•HieronymusBosch•15h ago•95 comments

Leaving serverless led to performance improvement and a simplified architecture

https://www.unkey.com/blog/serverless-exit
378•vednig•21h ago•207 comments

What is going on with all this radioactive shrimp?

https://www.consumerreports.org/health/food-safety/radioactive-shrimp-explained-a5493175857/
58•riffraff•5d ago•13 comments

Looking at kmalloc() and the SLUB Memory Allocator (2019)

https://ruffell.nz/programming/writeups/2019/02/15/looking-at-kmalloc-and-the-slub-memory-allocat...
20•signa11•3d ago•0 comments

Who's Submitting AI-Tainted Filings in Court?

https://cyberlaw.stanford.edu/whos-submitting-ai-tainted-filings-in-court/
44•cratermoon•7h ago•21 comments

Writing an LLM from scratch, part 22 – training our LLM

https://www.gilesthomas.com/2025/10/llm-from-scratch-22-finally-training-our-llm
167•gpjt•8h ago•5 comments

Functions Are Asymmetric

https://www.elbeno.com/blog/?p=1804
18•ingve•4d ago•11 comments

A Gemma model helped discover a new potential cancer therapy pathway

https://blog.google/technology/ai/google-gemma-ai-cancer-therapy-discovery/
121•alexcos•13h ago•32 comments

Show HN: Halloy – Modern IRC client

https://github.com/squidowl/halloy
320•culinary-robot•20h ago•87 comments

IRS open sources its fact graph

https://github.com/IRS-Public/fact-graph
269•ronbenton•9h ago•64 comments

JustSketchMe – Digital Posing Tool

https://justsketch.me
6•surprisetalk•5d ago•1 comments

F5 says hackers stole undisclosed BIG-IP flaws, source code

https://www.bleepingcomputer.com/news/security/f5-says-hackers-stole-undisclosed-big-ip-flaws-sou...
175•WalterSobchak•19h ago•82 comments

Next Steps for the Caddy Project Maintainership

https://caddy.community/t/next-steps-for-the-caddy-project-maintainership/33076
173•francislavoie•11h ago•82 comments

ImapGoose

https://whynothugo.nl/journal/2025/10/15/introducing-imapgoose/
70•xarvatium•10h ago•10 comments

Pwning the Nix ecosystem

https://ptrpa.ws/nixpkgs-actions-abuse
266•SuperShibe•18h ago•53 comments

Retiring Windows 10 and Microsoft's move towards a surveillance state

https://www.scottrlarson.com/publications/publication-windows-move-towards-surveillance/
413•trinsic2•7h ago•270 comments

Recursive Language Models (RLMs)

https://alexzhang13.github.io/blog/2025/rlm/
105•talhof8•14h ago•28 comments