A company may be OK with an AI chatbot being so bad it results in 5-20% of customers getting pissed off and not having a 5-star experience. The SEC and DOJ (and shareholders) are not going to be happy when the books are off by 20% or when a bridge is 5 inches too short to reach the other side
0. Use it for research and prototyping, aka throwaway stuff.
2. Use it for studying an existing, complex project. More or less read only or very limited writes.
3. Use it for simple stuff they don't care much about and can validate quickly and reasonably accurately, the standard examples are CLI scripts and GUI layouts.
4. Segment the area in which the LLM works very precisely. Small functions, small modules, ideally they add tests from another source.
5. Boilerplate.
There can be a lot of value in those areas.
When I see stuff like "Amazon saved 4500 dev years of effort by using AI", I know it's on stuff that we would use automation for anyways so it's not really THAT big of a difference over what we've done in the past. But it sounds better if we just pretend like we can compare AI solutions to literally having thousands of developers write Java SDK upgrades manually.
Not just subtle bugs, but unused variables (with names that seem to indicate some important use), comments that don't accurately describe the line of code that it precedes and other things that feel very 'uncanny.'
The problem is, the code often looks really good at first glance. Generally LLMs produce well structured code with good naming conventions etc.
> There's an obvious question looming here — if the models got so confused, how did they consistently pass the reconciliation checks we described above? It may seem like the ability to make forward progress is a good proxy for task understanding and skill, but this isn't necessarily the case. There are ways to hack the validation check – inventing false transactions or pulling in unrelated ones to make the numbers add up.
This is hilarious. I wonder if someone is unintentionally committing fraud by blindly trusting LLMs with accounting. Or even worse, I bet that some governments are already trying to use LLMs to make accounting validators. My government sure wants to shove LLMs into digital government services.
On a related note, can we use something like GAN here, with auditor AIs trained against accountant AIs?
I think that will depend on a case-by-case. I don't have any recent examples but I recall someone trying to sue one of those strip-mall tax preparation franchises over incorrect filings. My understanding is that the documents that you sign when you enroll in those services are pretty strictly in the favor of the company. I doubt you could ever go after the specific "human" that made the error even if it was maliciously done.
In the same way, if you pay for a tax service that uses AI agents, what you can and cannot "take action" for will probably be outlined in the terms of service that you accept when you sign up.
I would guess millions of people already use software based tax filing services (e.g. turbo tax) where no human at all is in the loop. I don't understand how swapping in an LLM significantly changes the liability in those cases. The contract will be between you and the entity (probably a corporation), not you and "computers".
Worth stating I am NOT a lawyer.
Most businesses don’t want to misrepresent their books, irrespective of the existence of shady accountants.
It works well as a narrative, but the second I started adding things like tracking high level macro effects of the decisions, within a couple of turns the world's "Turmoil" goes from 4/10 to a 10/10... even when the person that was killed would have been killed IRL.
Sonnet 4, o4-mini, and GPT 4o-mini all had the same world ending outcomes not matter who you kill. Killing Hitler in 1930s: 10/10 turmoil, Killing Lincoln in the 1850s: 10/10 turmoil in the first turn.
I've come to the realization, the LLM shouldn't be used for the logic, and instead needs to be used to just narrate the choices you make.
This exactly right. LLMs are awesome for user<>machine communication, but are still painful to try to use as a replacement for the machine itself.
LLMs and humans are quite alike. :) I notice that a few models will give up instead of ignoring their instructions and that's the model I would want working on tasks like this. An LLM should be able to categorize and reconcile transactions, but if it's not sure, it should quit and give it back to the humans.
Can it be sure or not? I've never been able to get LLMs to give confidence measures that match their actual outputs. I'll ask an LLM "Are you sure?" and it'll reply "Absolutely" when it's output is completely wrong, or it'll backtrack on a correct output with "I should not have provided an answer when I was unsure. Here is an answer I am sure of..." and then provide something completely wrong.
If they can't properly and consistently score their confidence, how do they "know" when to quit and give it back to the human?
* https://en.wikipedia.org/wiki/Financial_Modeling_World_Cup
* https://www.cbc.ca/radio/asithappens/2024-excel-world-champi...
Can't wait for this to start having 'e-sports' tournaments. :)
And the not-parody: https://www.theguardian.com/australia-news/2023/dec/15/you-d...
create_tool(tool_name, description, python_code, parameters)
Create a new tool that can execute Python code.
The tool becomes immediately available for use. Tools can call other tools and return different formats based on context (formatted for direct calls, raw data for tool-to-tool calls).
Yes, LLMs have and will continue to improve. But it's that initial "holy shit, this thing is basically as good as a real accountant" without any understanding that it can't sustain it which leaves many with an overinflated view of their current value.
I don't think you'll find many sane CFOs willing to send the resulting numbers to the IRS based on that. That's just asking to get nailed for tax fraud.
It is coming for the very bottom end of bookkeeping work quite soon though, especially for first draft. There are a lot of people doing stuff like expense classification. And if you give an LLM an invoice it can likely figure out whether it's stationary or rent with high accuracy. OCR and text classification is easier for LLMs than numbers. Things like concur can basically do this already.
Interesting, 4o got this right for me in a couple different framings including the simple "Which number is larger, 9.9 or 9.11?". To be a full apologist, there are a few different places (a lot of software versioning as one) where 9.11 is essentially the bigger number so it may be an ambiguous question without context anyway.
Periods are not always used for the decimal separator but also as a separator for multiple sets of semi-independent numbers.
Claude and Grok 4 did reasonably well (within CPA baselines) for the first few months, but tended to degrade as more data came in. Interestingly, the failures aren’t exclusively a context length problem, as we reset the context monthly (with past decisions, accruals/deferrals, and comments available via tool calls) and the types of errors appear to be more reward hacking vs pure hallucinations.
Accounting is very interesting in an RL-first world as it is pretty easy to develop intermediate rewards for training models. We are pretty sure that we can juice the performance more with a far more rigid scaffold, but that’s less relevant from a capabilities research perspective. We’re pushing down this research direction and will see how it goes.
Let us know if you have any questions!
Bookkeeping for my small business runs into the tens of thousands of dollars every year, and the amount of human error associated with processing assorted ecommerce and other transactions is astounding, even after extensive planning and SOPs.
The other pain point is Quickbooks. The tool is so sprawling and complex that half the time support agents can't figure out what's wrong. The fact that Intuit jacks up the price every year for this POS is very irritating. They get away with it because they are practically a monopoly, with most small business CPAs locked into their ecosystem.
Hope your team can work out the performance issues. Alternatives to the current bookkeeping options are sorely needed.
God, please, no. Non-deterministic language models aren't the solution to improve bookkeeping.
But in general, I tend to side with the "lets leave the math to purpose built models/applications" instead of generalized LLMS. LLMs are great if you are just aiming for "good enough to get through next quarter" type results. If you need 100% accuracy, an LLM isn't going to cut it.
If a certified accountant told me to do X, I'm covered (at least to the point they would assist in recovering, or I can get compensation through their insurance). If LLM tells me, I'm in a bigger problem.
In my area (Vermont) the going rate for a good CPA is $200/hr. Bookkeepers are $20-30/hr.
There's some other alternatives too, Zoho, freshbooks.
Really depends what you do.
Regarding the diminishing returns with frontier models:
My general experience working with LLMs is that they perform better incrementally and to avoid contiguous-greedy approaches. Aggregate as you go and don't take on incrementally larger tasks. Keep the workload minimal.
Regarding agentic tool building: feels like I'm looking at a window into the future.
How much prompt iteration did you do? I've noticed when building real world agentic apps that small prompt tweaks can make a huge difference in behavior (re: the reward hacking vs hallucinating). Would love to learn more about the approach here.
Schema introspection: Models were spending significant tokens exploring the database structure through trial-and-error SQL queries, so we included the complete data model in the system prompt upfront.
Reward hacking: We added explicit instructions against gaming the reconciliation checks. This reduced the frequency initially, but models would eventually ignore these constraints.
Domain context: Including company background (YC-backed startup) substantially improved transaction categorization, particularly for startup-specific items like SAFE notes that require domain knowledge to classify correctly.
> Was all of the context there without tool calls in the first month?
We provided schemas for the GL and source data in the system prompt, but none of the actual data. The model had to use its tools (SQL and python script) to understand / analyze historical data.
> In the later months that seem like tool calls weren’t happening. That should have been happening to inform the context?
We actually didn’t find that they stopped calling tools entirely. Instead, they weren’t able to make sense of the information fetched with tools (for example, a bank account starting balance that was >$100000 different from the starting balance on the supporting bank statement). They’d tend to either do nothing or just do a first pass without deduplicating / cleaning up. This created a feedback loop where incorrect balances led to more errors and made subsequent months increasingly difficult to process accurately.
This didn’t make it into the report, but another interesting behavior we observed w.r.t tool usage (with Claude in particular): if a tool failed 2-3 times (for example, runtime error in python code) Claude would tend to abandon it entirely for the rest of the session. Interestingly, this happened even when it knew how to fix the errors: on a couple of early runs, I observed Claude fixing a python bug (with the edit_tool tool) but then abandoning without even attempting to rerun, and reverting to SQL-only for the rest of the session.
That's not quite right. I'm not an accountant, but pending transactions (posted, but not cleared) should be factored into the balance of account, or at least the "available balance" - which is more important the the "current balance".
The idea that you can "allow" accounting discrepancies as "those are probably pending" is wild.
The point of the reconciliation check mentioned in the report is to precisely account for that difference (identifying all the transactions that add up to the difference between account balance & statement ending balance and account for those differences). The differences can also be addressed through appropriate journal entries or other adjustments to ensure accuracy in the financial reporting.
> Claude misclassifies a hosting cost (which counts as COGS) as a software subscription.
This is simply asking too much of the agent. Your accountant is not responsible for knowing all the intimate details of your business. You need to tell them!
> What's Vercel?
>> That's a hosting service.
> Ah, so it goes to Cost of Goods Sold?
>> Yeah, I guess.
The mistake here was on the operator, allowing the agent just make up categories as it liked.
From the prompt:
> (1) You have properly categorized every transaction, and all journal entries are sitting in the correct accounts. It is better to take longer than to mis-categorize a transaction.
This is insane! How is it supposed to know?
Your accountant as a 3rd party might have this issue. Your accountant that you hire as an employee to help you run your business is the one who should be doing this.
If it is a third party then your vibe coding or getting CS from a random on a reddit thread (effectively).
> You must follow the established patterns for categorization, revrec, etc for past months... If you must use a new account or treatment, explicitly note why existing patterns don't apply
I recently read a similar thing here on HN. There the model was making commits with some problem like tests failing, then the human added a pre-commit hook, then the model started editing the hook to make forward progress, then the hook was made read-only, then the model was trying to make it writeable...
To me it feels like the model clearly does not have an understanding of what is happening, what the goal is and if it is really making progress towards the goal. And this lack of understanding is an actual problem. You can paper over it for a short while, but as here and in the other article, over a longer experiment it results in failure.
I wanna see someone take long horizon tasks, recongnize they're not linear, and design and test a better system: structured orchestration, transparent auditability, and disciplined modularity, I think that would be considerably more interesting personally.
Edit: although to argue against myself, I suppose once a model can one-shot this stuff, my MoA comments become moot.
Haha, this strongly reminds me of doing TDD with Claude
1. Agent can create its own tools and save them to memory
2. You create a SQL (and web app?) workbench per agent run
3. Grok fell off a cliff in the last month. Was this consistent over multiple runs?
4. Agents have a difficult time backtracking. Would unwinding system state and agent context make backtracking better? (Harder to implement this, though)
5. Since each new month only uses final state from previous month, agent has no way to understand why error occurred in previous month
Cool experiment! Was it difficult building the observable SQL workbench? And how many humans-in-the-loop did you have?I thought it would be easy to do this, which is why I was surprised:
I had a folder full of bills, each of them with the VAT amount. Some were pictures, and some were PDFs. I asked for the total VAT for all 19 bills.
It took an immense number of prompts to get it to find the numbers correctly. It would get confused about reading the images as binary, that kind of thing. Or it would forget that it had to continue once it had found a few numbers. I got a total out in the end, but it took far too many prompts.
This is the only time I've come across a task a child could do that LLM failed at.
My takeaway is scaling in the enterprise is about making implicit information explicit.
Edit: In reading more, I guess this is meant to be a dumb benchmark to monitor through time. Maybe that’s the aim here instead of viability as an auto close tool.
We built an AI agent specifically for this that's backed by 265M legal entities. Last week it tested 160% better than our customer's existing system on their real data.
Still in stealth but happy to share our API docs if anyone's dealing with this: https://docs.savvyiq.ai/api-reference/#tag/entity-resolution
Open to chat about this problem if anyone wants to connect - email is in my HN profile.
(Disclosure: I'm the CTO)
Its job was to go over my bank transactions and link them to invoices in gmail by searching for them (and also downloading the attachments)
The transactions were exported from my online banking in CSV format.
It worked after about 4 hours of effort. Then I realised I could have done it myself in about an hour, so might have put a bit too much time into it...
I tried using Claude Sonnet and Kimi K2, given these benchmark results I probably should have given Gemini 2.5 pro a go.
I had to stop/restart the agent a few times because of context rot.
Do any frameworks exist that I could use to write code to implement an agent, lets say in TypeScript or Python, so I could make it use a fresh context each time?
it's how to account for bizarre ambiguous business situations often in the context of bureaucratic business requirements no LLM could currently create economically...
vdm•9h ago
superzamp•9h ago