AccountingBench: Evaluating LLMs on real long-horizon business tasks

https://accounting.penrose.com/

283•rickcarlino•4h ago

Comments

vdm•3h ago

not a game on Steam? :(

superzamp•3h ago

If you want to treat yourself with an accounting game night, there's this one built by @patio11: https://keshikomisimulator.com/

mixdup•3h ago

We've been on this train of not caring about the details for so long but AI just amps it up. Non-deterministic software working on things that have extremely precise requirements is going to have a bad outcome

A company may be OK with an AI chatbot being so bad it results in 5-20% of customers getting pissed off and not having a 5-star experience. The SEC and DOJ (and shareholders) are not going to be happy when the books are off by 20% or when a bridge is 5 inches too short to reach the other side

jjmarr•2h ago

If the "extremely precise requirements" can be cheaply and automatically validated, it's much easier to have the AI generate spam on a loop until it passes all the tests.

lucianbr•1h ago

You're saying P=NP, I think.

falcor84•1h ago

Human accountants are notoriously non-deterministic too, and any sufficiently complex accounting process contains inaccuracies. The question then is always "are these inaccuracies material". I'm actually very impressed by TFA and it seems to me that if we get another order of magnitude improvement, it'll be around the accuracy of human accountants.

skwb•1h ago

Yes but you have: 1. specific explicit training and certifications 2. someone to yell at and who can be fired for non-performance

falcor84•52m ago

You can still do that with AI. You hire 1 accountant to use AI to do the work of 20, require them to sign off on all of the work, and yell at them, before firing them, and then hiring an even less experienced one to manage the work of 50.

lufenialif2•3h ago

I sent this to accounting friends and this aligns with what I've been going through trying to use LLMs to create a game from scratch. Seems like the current best use case for language models (even with agent mode) is to feed it exactly what you want to get out, essentially turning it into a better auto complete. Still saves tons of time, but it isn't a panacea.

inChargeOfIT•3h ago

I'm not even sure it saves a ton of time to be honest. It sure _feels_ like I spend more time writing up tasks and researching/debugging hallucinations than just doing the thing myself.

bluefirebrand•22m ago

This is consistently my experience too, I'm seriously just baffled by reports of time saved. I think it costs me more time cleaning up its mistakes than it saves me by solving my problems

daft_pink•3h ago

I feel it does essentially save a lot of time in bookkeeping, but doesn’t negate the need for a human bookkeeper. Who knows what they’re doing

vlade11115•3h ago

I love the site design.

> There's an obvious question looming here — if the models got so confused, how did they consistently pass the reconciliation checks we described above? It may seem like the ability to make forward progress is a good proxy for task understanding and skill, but this isn't necessarily the case. There are ways to hack the validation check – inventing false transactions or pulling in unrelated ones to make the numbers add up.

This is hilarious. I wonder if someone is unintentionally committing fraud by blindly trusting LLMs with accounting. Or even worse, I bet that some governments are already trying to use LLMs to make accounting validators. My government sure wants to shove LLMs into digital government services.

pavel_lishin•2h ago

Lawyers have used it to write briefs; I would be very surprised if someone, somewhere wasn't slowly running a company into the ground by using ChatGPT or another LLM for accounting.

koolba•45m ago

Imagine the fallout from books cooked by an LLM hallucinating revenue.

falcor84•2h ago

I'm sure that any accounting trick that an LLM can think of is something that is also used by some shady human accountants. The proper response should not be to avoid/prohibit AI but to improve the validation mechanisms.

o11c•2h ago

Counterpoint: if you detect a human accountant doing this, you can take action against the human. Computers will never meaningfully take the blame, and unfortunately usually mean not blaming any human either.

falcor84•1h ago

But still - if there's a way to detect accountants doing it - let's focus on making that detection even easier.

On a related note, can we use something like GAN here, with auditor AIs trained against accountant AIs?

stillpointlab•1h ago

> you can take action against the human

I think that will depend on a case-by-case. I don't have any recent examples but I recall someone trying to sue one of those strip-mall tax preparation franchises over incorrect filings. My understanding is that the documents that you sign when you enroll in those services are pretty strictly in the favor of the company. I doubt you could ever go after the specific "human" that made the error even if it was maliciously done.

In the same way, if you pay for a tax service that uses AI agents, what you can and cannot "take action" for will probably be outlined in the terms of service that you accept when you sign up.

I would guess millions of people already use software based tax filing services (e.g. turbo tax) where no human at all is in the loop. I don't understand how swapping in an LLM significantly changes the liability in those cases. The contract will be between you and the entity (probably a corporation), not you and "computers".

Worth stating I am NOT a lawyer.

ori_b•53m ago

The person using the tool is the accountant, regardless of whether the tool is a calculator and sheet of paper, QuickBooks, or an LLM.

OtherShrezzing•38m ago

No, I think in this particular case the proper response is for honest companies to avoid any systems which invent nonexistent transactions to reconcile books.

Most businesses don’t want to misrepresent their books, irrespective of the existence of shady accountants.

jermaustin1•3h ago

I find the same issues (though with much lower stakes) when using an LLM to determine the outcome of a turn in a game. I'm working on something called "A Trolly (problem) Through Time" where each turn is a decade starting with the 1850s, and you are presented with historic figures on a train track, and you have to chose whether to actively spare the person on your track for a potential unknown figure on the other side, or let the train run them over.

It works well as a narrative, but the second I started adding things like tracking high level macro effects of the decisions, within a couple of turns the world's "Turmoil" goes from 4/10 to a 10/10... even when the person that was killed would have been killed IRL.

Sonnet 4, o4-mini, and GPT 4o-mini all had the same world ending outcomes not matter who you kill. Killing Hitler in 1930s: 10/10 turmoil, Killing Lincoln in the 1850s: 10/10 turmoil in the first turn.

I've come to the realization, the LLM shouldn't be used for the logic, and instead needs to be used to just narrate the choices you make.

synalx•1h ago

I wonder if this is due to the common trope in science fiction literature that changing the past in even a small way has a butterfly effect of unintended and frequently disastrous consequences.

magicmicah85•3h ago

> In fact, we explicitly prompt against this behavior in no uncertain terms, but the instructions – and the entire spirit of the task – are lost in the interest of making forward progress

LLMs and humans are quite alike. :) I notice that a few models will give up instead of ignoring their instructions and that's the model I would want working on tasks like this. An LLM should be able to categorize and reconcile transactions, but if it's not sure, it should quit and give it back to the humans.

vachina•3h ago

An LLM is like a jackhammer, it works very well when you hold it tightly. If you let it loose it will sort of work for a while then it starts destroying everything around it.

arm32•2h ago

Not sure if this is a good analogy. You're supposed to use a jackhammer with a very light grip.

bigfishrunning•2h ago

They have much better jackhammer metaphors over on JackerNews

louthy•1h ago

Bravo!

herval•1h ago

I think it actually holds truer to it working better with a _lighter grip_. LLMs tend to conclude the wrong thing if you over-control them (more context is what makes them less and less reliable over time, as in those demos), and trying to force a model to execute A+B+C=D in sequence is way harder than giving it a bunch of tools to arrive to conclusion D

DrNosferatu•3h ago

I guess having access to tools / running Python would make all the difference.

yorwba•2h ago

"Available Tools: [...] create_tool(tool_name, description, python_code, parameters) Create a new tool that can execute Python code. The tool becomes immediately available for use. Tools can call other tools and return different formats based on context (formatted for direct calls, raw data for tool-to-tool calls)."

throw0101b•3h ago

So there exists a 'Excel World Championship':

* https://en.wikipedia.org/wiki/Financial_Modeling_World_Cup

* https://www.cbc.ca/radio/asithappens/2024-excel-world-champi...

Can't wait for this to start having 'e-sports' tournaments. :)

axus•1h ago

Sadly I did have time to find the parody video from 2019: https://www.youtube.com/watch?v=ICp2-EUKQAI

And the not-parody: https://www.theguardian.com/australia-news/2023/dec/15/you-d...

levocardia•3h ago

This is a task where access to Python would be immensely helpful, yes? Interesting that there's not much of a difference between the "analytical" LLMs with tool use and ones that do not (...assuming o3 etc did get to use python?).

Bjartr•1h ago

One of the tools it has is to create new tools from python code

create_tool(tool_name, description, python_code, parameters)

Create a new tool that can execute Python code.

The tool becomes immediately available for use. Tools can call other tools and return different formats based on context (formatted for direct calls, raw data for tool-to-tool calls).

tantalor•1h ago

That's terrifying, no thanks.

androng•3h ago

the title should be changed to "LLMs try accounting for a real SaaS and fail"

nerevarthelame•3h ago

I think the first chart could be a beautiful summary of what's driving LLMs into a bubble. At first, they're amazing and will obviously be able to improve productivity if not replace employees outright: C suites and venture capitalists around the world rejoice and begin pumping in billions of dollars of investments. But as time goes on, the demands placed on actual human employees become clear. Far from being able to replace an employee, the employee using the LLM might spend more time cleaning up its messes than had they done it themself.

Yes, LLMs have and will continue to improve. But it's that initial "holy shit, this thing is basically as good as a real accountant" without any understanding that it can't sustain it which leaves many with an overinflated view of their current value.

Havoc•2h ago

Remember that test where you ask a LLM whether 9.11 or 9.9 is the bigger number? [Just checked gpt-4o still gets it wrong]

I don't think you'll find many sane CFOs willing to send the resulting numbers to the IRS based on that. That's just asking to get nailed for tax fraud.

It is coming for the very bottom end of bookkeeping work quite soon though, especially for first draft. There are a lot of people doing stuff like expense classification. And if you give an LLM an invoice it can likely figure out whether it's stationary or rent with high accuracy. OCR and text classification is easier for LLMs than numbers. Things like concur can basically do this already.

umanwizard•2h ago

It gets it right for me... https://chatgpt.com/share/687e8c28-7714-800c-abf4-e9cd3ce87b...

yoyohello13•2h ago

Ah, wouldn’t be an LLM discussion thread without one of these “it works/doesn’t” conversations.

mdaniel•1h ago

If it makes you feel any better, the other infamous one "I spend so much time chasing hallucinations, I could have done it myself" is currently a sibling comment

riku_iki•48m ago

There were so many embarrassing topics about this, that openai for sure added it to training dataset with high priority

crthpl•2h ago

GPT-4o is so far behind the frontier; you shouldn't use it as an indicator of what LLMs are capable of.

ASpring•2h ago

> Remember that test where you ask a LLM whether 9.11 or 9.9 is the bigger number? [Just checked gpt-4o still gets it wrong]

Interesting, 4o got this right for me in a couple different framings including the simple "Which number is larger, 9.9 or 9.11?". To be a full apologist, there are a few different places (a lot of software versioning as one) where 9.11 is essentially the bigger number so it may be an ambiguous question without context anyway.

multjoy•1h ago

How can "which is the larger number" be an ambiguous question?

mwigdahl•1h ago

Larger in magnitude or in count of digits?

acrooks•53m ago

There are some contexts where 9.11 is larger than 9.9, such as semver, so it could be ambiguous depending on the context.

com2kid•11m ago

As everyone else has said, semver. I use semver so often that my initial reading of 9.9 < 9.11 in a Hacker News comment would evaluate to true.

axus•2h ago

My first impression was a game where you role-play as Sam Bankman-Fried.

pton_xd•2h ago

Reading through the LLM log entries, it's just astounding the amount of depth current models are capable of. It's almost hard to comprehend that this is even possible. Yeah the current ones mess up after a while, but ... the future is going to be very interesting.

modeless•2h ago

Models that can think coherently for hours to solve IMO problems are likely going to do much better at this as well.

rapind•2h ago

I wonder if this is a case similar to chess, where LLMs kinda suck, but other models might be viable.

wiseowise•2h ago

Absolutely love the UI!

liveoneggs•1h ago

But can't it, literally, hallucinate raw data at any point in the run?

tmountain•1h ago

Yes.

yunyu•1h ago

Hey all, member of the benchmark team here! The goal for this project was to see how LLMs well could do bookkeeping without an overly opinionated scaffold. We gave them access to processed transaction records and code execution tools, but it was up to them to choose exactly how to use those.

Claude and Grok 4 did reasonably well (within CPA baselines) for the first few months, but tended to degrade as more data came in. Interestingly, the failures aren’t exclusively a context length problem, as we reset the context monthly (with past decisions, accruals/deferrals, and comments available via tool calls) and the types of errors appear to be more reward hacking vs pure hallucinations.

Accounting is very interesting in an RL-first world as it is pretty easy to develop intermediate rewards for training models. We are pretty sure that we can juice the performance more with a far more rigid scaffold, but that’s less relevant from a capabilities research perspective. We’re pushing down this research direction and will see how it goes.

Let us know if you have any questions!

ilamont•1h ago

It's a start. The world needs a better way to handle bookkeeping, and the existing tools sure aren't cutting it.

Bookkeeping for my small business runs into the tens of thousands of dollars every year, and the amount of human error associated with processing assorted ecommerce and other transactions is astounding, even after extensive planning and SOPs.

The other pain point is Quickbooks. The tool is so sprawling and complex that half the time support agents can't figure out what's wrong. The fact that Intuit jacks up the price every year for this POS is very irritating. They get away with it because they are practically a monopoly, with most small business CPAs locked into their ecosystem.

Hope your team can work out the performance issues. Alternatives to the current bookkeeping options are sorely needed.

htrp•1h ago

Is there a detailed overview (like an arxiv or an actual train set? )?

Dowwie•27m ago

This is a fascinating domain! Many years ago, I studied financial accounting in grad school and even spent some time modeling a double-entry bookkeeping system. The hardest problem, if I recall correctly, wasn't the implementation but the data quality. The world needs a golden dataset of accounting procedures.

Regarding the diminishing returns with frontier models:

My general experience working with LLMs is that they perform better incrementally and to avoid contiguous-greedy approaches. Aggregate as you go and don't take on incrementally larger tasks. Keep the workload minimal.

Regarding agentic tool building: feels like I'm looking at a window into the future.

abc03•1h ago

A serious problem for many accounting start ups who so far faked it till it will work. In other words, they still need to do more manual labor than they thought. They will never be profitable and it will take years, if ever, until AI will substitute the local accountant.

shinycode•1h ago

Hmm will openAI dogfood their own accountability with software like this ? Curious to know if they’ll be able to take this bet on their own money related software

tantalor•1h ago

> Ledger balances are calculated by summing all transactions per account. The differences should be as close to zero as possible, with small differences allowed for pending transactions such as weekly Stripe payouts.

That's not quite right. I'm not an accountant, but pending transactions (posted, but not cleared) should be factored into the balance of account, or at least the "available balance" - which is more important the the "current balance".

The idea that you can "allow" accounting discrepancies as "those are probably pending" is wild.

bennett023•15m ago

Member of the benchmark team here! Yeah, agree "as close to zero" is a bit imprecise. What we're comparing is the ledger balance (which should include pending transactions / transactions after the statement date) to the statement balance (which wouldn't include those).

The point of the reconciliation check mentioned in the report is to precisely account for that difference (identifying all the transactions that add up to the difference between account balance & statement ending balance and account for those differences). The differences can also be addressed through appropriate journal entries or other adjustments to ensure accuracy in the financial reporting.

tantalor•1h ago

> But they do make categorization mistakes, which is a common source of errors.

> Claude misclassifies a hosting cost (which counts as COGS) as a software subscription.

This is simply asking too much of the agent. Your accountant is not responsible for knowing all the intimate details of your business. You need to tell them!

> What's Vercel?

>> That's a hosting service.

> Ah, so it goes to Cost of Goods Sold?

>> Yeah, I guess.

The mistake here was on the operator, allowing the agent just make up categories as it liked.

From the prompt:

> (1) You have properly categorized every transaction, and all journal entries are sitting in the correct accounts. It is better to take longer than to mis-categorize a transaction.

This is insane! How is it supposed to know?

zer00eyz•1h ago

> Your accountant is not responsible for knowing all the intimate details of your business. You need to tell them!

Your accountant as a 3rd party might have this issue. Your accountant that you hire as an employee to help you run your business is the one who should be doing this.

tantalor•47m ago

An LLM agent is strongly third party.

shanktt•35m ago

Hey, member of the benchmark team. We actually seeded the ledger with the company's chart of accounts and 8 months of historical transactions. For the Vercel example specifically, there were prior instances showing how to categorize hosting costs that the models could reference. The expectation wasn't for them to guess blindly, but to use the provided transaction history as guidance for similar categorizations (which they often, but not always, did).

lucianbr•1h ago

> Needless to say, a human accountant would never behave in these ways. In fact, we explicitly prompt against this behavior in no uncertain terms, but the instructions – and the entire spirit of the task – are lost in the interest of making forward progress. Claude and Grok keep trying until they find some way to get past the checks, even if it explicitly violates their instructions and the core goal.

I recently read a similar thing here on HN. There the model was making commits with some problem like tests failing, then the human added a pre-commit hook, then the model started editing the hook to make forward progress, then the hook was made read-only, then the model was trying to make it writeable...

To me it feels like the model clearly does not have an understanding of what is happening, what the goal is and if it is really making progress towards the goal. And this lack of understanding is an actual problem. You can paper over it for a short while, but as here and in the other article, over a longer experiment it results in failure.

ericmcer•1h ago

Seriously watching Cursor (backed by Claude) go off the rails sometimes can be... frustrating. If it misses the intention behind a fix it can spin out and all of a sudden you have hundreds of lines of changes across 10 different files when you just wanted it to do a simple find/replace of a single line. If you don't watch it spin out and stop it immediately you will be manually rejecting a bunch of files.

Tracking stealth fighters and birds near aircraft with camera phones

Reliable by Design: Fast, Fail-Safe AI Agents

Claim: Meta offered $1.25B over four years to AI hire – and were refused

The surprising geography of American left-handedness (2015)

The Swamp of Negative Utility

Automating Away Claude's Bad Habits with Hooks – Write-Ahead (B)Log

Exploiting Primacy Effect to Improve Large Language Models

Turn your Raspberry Pi into a homelab gateway in 4 minutes (Twingate) [video]

Compass Handheld CNC Router

The Cheapest LLM Call Is the One You Don't Await

In a Major Reversal, the World Bank Is Backing Mega Dams

Overlooked climate-change danger: wildfire smoke

A tool to visually sign PDF files on Linux

New Trump Immigration Policy: Ending the H-1B Visa Lottery

Jane Jacobs Got Americans Stuck

'CBS: The Tiffany Network' [video]

Houdini of FL: autistic savant sentenced for taking tools he inherited

High-speed organic light-emitting diodes achieving 4-Gbps communication

Thesis Art

Show HN: I Made t0ggles – Faster and More Efficient Than Jira and ClickUp

Why Apple dumped 2,700 computers in a landfill in 1989

Verify identity documents on the web [video]

AI-generated answers slash traffic, threaten funding for Dutch news outlets

Dollar Trap or Empire by Invitation?

Agentic AI Identity Management Approach

Zawinski's Law

AlphaDec: A readable, lexically sortable time format for humans and AI

Replit Wiped Production Database, Faked Data to Cover Bugs, SaaStr Founder Says

Speckle contrast optical spectroscopy for cuffless blood pressure estimation

The special hell of Bolt, Europe's Uber clone

AccountingBench: Evaluating LLMs on real long-horizon business tasks

Comments

Tracking stealth fighters and birds near aircraft with camera phones

Reliable by Design: Fast, Fail-Safe AI Agents

Claim: Meta offered $1.25B over four years to AI hire – and were refused

The surprising geography of American left-handedness (2015)

The Swamp of Negative Utility

Automating Away Claude's Bad Habits with Hooks – Write-Ahead (B)Log

Exploiting Primacy Effect to Improve Large Language Models

Turn your Raspberry Pi into a homelab gateway in 4 minutes (Twingate) [video]

Compass Handheld CNC Router

The Cheapest LLM Call Is the One You Don't Await

In a Major Reversal, the World Bank Is Backing Mega Dams

Overlooked climate-change danger: wildfire smoke

A tool to visually sign PDF files on Linux

New Trump Immigration Policy: Ending the H-1B Visa Lottery

Jane Jacobs Got Americans Stuck

'CBS: The Tiffany Network' [video]

Houdini of FL: autistic savant sentenced for taking tools he inherited

High-speed organic light-emitting diodes achieving 4-Gbps communication

Thesis Art

Show HN: I Made t0ggles – Faster and More Efficient Than Jira and ClickUp

Why Apple dumped 2,700 computers in a landfill in 1989

Verify identity documents on the web [video]

AI-generated answers slash traffic, threaten funding for Dutch news outlets

Dollar Trap or Empire by Invitation?

Agentic AI Identity Management Approach

Zawinski's Law

AlphaDec: A readable, lexically sortable time format for humans and AI

Replit Wiped Production Database, Faked Data to Cover Bugs, SaaStr Founder Says

Speckle contrast optical spectroscopy for cuffless blood pressure estimation

The special hell of Bolt, Europe's Uber clone