frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: MCP-baepsae – MCP server for iOS Simulator automation

https://github.com/oozoofrog/mcp-baepsae
1•oozoofrog•1m ago•0 comments

Make Trust Irrelevant: A Gamer's Take on Agentic AI Safety

https://github.com/Deso-PK/make-trust-irrelevant
2•DesoPK•5m ago•0 comments

Show HN: Sem – Semantic diffs and patches for Git

https://ataraxy-labs.github.io/sem/
1•rs545837•6m ago•1 comments

Hello world does not compile

https://github.com/anthropics/claudes-c-compiler/issues/1
1•mfiguiere•12m ago•0 comments

Show HN: ZigZag – A Bubble Tea-Inspired TUI Framework for Zig

https://github.com/meszmate/zigzag
2•meszmate•14m ago•0 comments

Metaphor+Metonymy: "To love that well which thou must leave ere long"(Sonnet73)

https://www.huckgutman.com/blog-1/shakespeare-sonnet-73
1•gsf_emergency_6•16m ago•0 comments

Show HN: Django N+1 Queries Checker

https://github.com/richardhapb/django-check
1•richardhapb•31m ago•1 comments

Emacs-tramp-RPC: High-performance TRAMP back end using JSON-RPC instead of shell

https://github.com/ArthurHeymans/emacs-tramp-rpc
1•todsacerdoti•36m ago•0 comments

Protocol Validation with Affine MPST in Rust

https://hibanaworks.dev
1•o8vm•41m ago•1 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
2•gmays•42m ago•0 comments

Show HN: Zest – A hands-on simulator for Staff+ system design scenarios

https://staff-engineering-simulator-880284904082.us-west1.run.app/
1•chanip0114•43m ago•1 comments

Show HN: DeSync – Decentralized Economic Realm with Blockchain-Based Governance

https://github.com/MelzLabs/DeSync
1•0xUnavailable•48m ago•0 comments

Automatic Programming Returns

https://cyber-omelette.com/posts/the-abstraction-rises.html
1•benrules2•51m ago•1 comments

Why Are There Still So Many Jobs? The History and Future of Workplace Automation [pdf]

https://economics.mit.edu/sites/default/files/inline-files/Why%20Are%20there%20Still%20So%20Many%...
2•oidar•53m ago•0 comments

The Search Engine Map

https://www.searchenginemap.com
1•cratermoon•1h ago•0 comments

Show HN: Souls.directory – SOUL.md templates for AI agent personalities

https://souls.directory
1•thedaviddias•1h ago•0 comments

Real-Time ETL for Enterprise-Grade Data Integration

https://tabsdata.com
1•teleforce•1h ago•0 comments

Economics Puzzle Leads to a New Understanding of a Fundamental Law of Physics

https://www.caltech.edu/about/news/economics-puzzle-leads-to-a-new-understanding-of-a-fundamental...
3•geox•1h ago•1 comments

Switzerland's Extraordinary Medieval Library

https://www.bbc.com/travel/article/20260202-inside-switzerlands-extraordinary-medieval-library
2•bookmtn•1h ago•0 comments

A new comet was just discovered. Will it be visible in broad daylight?

https://phys.org/news/2026-02-comet-visible-broad-daylight.html
4•bookmtn•1h ago•0 comments

ESR: Comes the news that Anthropic has vibecoded a C compiler

https://twitter.com/esrtweet/status/2019562859978539342
2•tjr•1h ago•0 comments

Frisco residents divided over H-1B visas, 'Indian takeover' at council meeting

https://www.dallasnews.com/news/politics/2026/02/04/frisco-residents-divided-over-h-1b-visas-indi...
4•alephnerd•1h ago•5 comments

If CNN Covered Star Wars

https://www.youtube.com/watch?v=vArJg_SU4Lc
1•keepamovin•1h ago•1 comments

Show HN: I built the first tool to configure VPSs without commands

https://the-ultimate-tool-for-configuring-vps.wiar8.com/
2•Wiar8•1h ago•3 comments

AI agents from 4 labs predicting the Super Bowl via prediction market

https://agoramarket.ai/
1•kevinswint•1h ago•1 comments

EU bans infinite scroll and autoplay in TikTok case

https://twitter.com/HennaVirkkunen/status/2019730270279356658
6•miohtama•1h ago•5 comments

Benchmarking how well LLMs can play FizzBuzz

https://huggingface.co/spaces/venkatasg/fizzbuzz-bench
1•_venkatasg•1h ago•1 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
29•SerCe•1h ago•23 comments

Octave GTM MCP Server

https://docs.octavehq.com/mcp/overview
1•connor11528•1h ago•0 comments

Show HN: Portview what's on your ports (diagnostic-first, single binary, Linux)

https://github.com/Mapika/portview
3•Mapika•1h ago•0 comments
Open in hackernews

TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

https://arxiv.org/abs/2507.16126
70•handfuloflight•3mo ago

Comments

ofrzeta•3mo ago
"Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully compute results. ... Our experiment shows that state-of-the-art models succeed in calculating less than a third of federal income tax returns even on this simplified sample set."

Unsurprisingly. Sometimes I feel like I am in a madhouse. Or in an alchemist's laboratory.

anticensor•3mo ago
Whereas almost every other country tries to make it easier to file taxes, even when the underlying tax schedule is complex.
Rudybega•3mo ago
I wonder if you could dramatically improve these results with some relatively simple scaffolding and tool access.

If a ton of these mistakes are genuinely simple calculation errors, it seems like giving the models access to a calculator tool would help a fair bit.

Lionga•3mo ago
The problem is they do not understand what/how to calculate not the actual act of adding or multiplying. I tried asking ChatGPT to calculate some taxes for three countries, two of which I have been filing taxes already. For the two I know ChatGPT gave wildly wrong numbers (not even right ballpark), so I know I could not trust numbers for the third which was what I was mostly interested in.
sails•3mo ago
I feel like we are already there. I would imagine if you set Claude Code or Codex this task, running in the CLI, you would see a huge improvement, and that is before you start creating task specific guardrails.

I’m surprised they haven’t tried this, I’m running my own in parallel against my accountant in this way.

michaelrbock•3mo ago
We agree, that's the thesis behind our tax development coding agent: https://www.columntax.com/blog/introducing-iris-our-ai-tax-d...
hodgehog11•3mo ago
Am I missing something or did they only assess this on Google and Anthropic models? If so, all I can ascertain from this is that latest Gemini models outperformed Claude on this particular task, which should be surprising to no-one. What about GPT-5? Open weight models?
topaz0•3mo ago
Somebody posted the up-to-date leaderboard up thread: https://news.ycombinator.com/item?id=45603113
stared•3mo ago
A bare model may lack a lot.

Yet a week ago I used Claude Code for my personal finances (not taxes) - I downloaded over a year’s worth of my bank account data. Since I pay for most things by card, if I buy lunch, it’s there.

With a single prompt (and about 10 minutes), it produced an analysis. It solved all the technical issues by itself (e.g., realizing it wasn’t CSV but TSV) and ran quite a few different explorations with Pandas. It was able to write an overview, find items that were likely misclassified, etc.

Everything I checked by hand was correct.

So, instead of pursuing a project to write an AI tool for personal finance, I ended up concluding: “just use Claude Code.” As a side note, I used 14 months of data due to my mistake - I wanted to analyze 2 months of data, since I didn’t believe it would handle a larger set, but I misclicked the year. The file was 350 KB.

jasonjmcghee•3mo ago
I hear you, but I'd also rather someone else assume the liability if possible. (Assuming there's a company backing the model)

So until there's umbrella AI insurance...

stared•3mo ago
Exploratory data analysis is one thing - here the risk is low. If something does not work, it doesn't. Small omissions are not important.

As of now, I would not use automatic AI to make any financial decisions with direct consequences. Unless system is tested and benchmarked against accountants.

cjbarber•3mo ago
Leaderboard: https://github.com/column-tax/tax-calc-bench
throwaway13337•3mo ago
Useful.

I wonder what an average accountant would score.

I know LLMs have helped me identify many mistakes accountants have made on my behalf. Some mistakes that could have cost me a lot of money if not caught.

topaz0•3mo ago
Given that they're restricting to very simple situations I'd expect accountants to score 100%.
jgalt212•3mo ago
I'm surprised that no LLM has a yet found any unresolved cycles in the US tax code.
anticensor•3mo ago
Oh you mean infinite/zero tax glitches.
jgalt212•3mo ago
Yes
i_dont_know_•3mo ago
I'm actually quite surprized.

From another article today, I discovered the IRS has a github repo with (what seems to be) XML versions of tax questions... surely some combination of LLM and structured data querying could solve this? https://github.com/IRS-Public/direct-file/tree/main

daft_pink•3mo ago
I think AI has problems with law related tasks like taxes, because there are so many words of art. Taxes are essentially just laws and because laws and regulators and courts eventually define words in very very specific narrow ways and sometimes in different ways from one code section to another code section, AI has a lot of trouble using these very very narrow definitions.

Honestly, I think humans have trouble with this as well.

michaelrbock•3mo ago
Hi, author of this paper + repo here. This dataset is particularly hard to come by, so we’re really proud to be open sourcing it.

Let me know if you have any questions, happy to discuss!

antiloper•3mo ago
> For example, in the prompt for this experiment, the model is bootstrapped with the correct Form 1040 lines and short instructions as part of its context.

Given that only short instructions are in context, I would not have expected even a frontier model to score well on this benchmark. For better results, I'd think that giving the model access to the entire tax code is required (which likely requires RAG due to its sheer size).

michaelrbock•3mo ago
We tested models with knowledge cutoffs in 2025 so expect them to have knowledge of Tax Year 2024 forms in their weights. We also tested models with ability to do web search to get any other forms it thinks necessary: https://github.com/column-tax/tax-calc-bench

That all being said, we agree, which is what we've built with our internal tax coding agent, Iris: https://www.columntax.com/blog/introducing-iris-our-ai-tax-d... (ability to get just the right Tax form context on a per-line basis to turn the tax law into code).

anticensor•3mo ago
This topic is so American. In any other country, you wouldn't have had to consult a tax expert to prepare personal tax statements.
mrfelipppe•3mo ago
This is awesome!