Generalist AI gets a C+ in accounting

Artificial intelligence-focused accounting ERP provider DualEntry tested some of the most popular AI models on various accounting workflows and found that, at best, they're 77.3% accurate. 

Processing Content

"Large language models are powerful drafting tools, but finance doesn't run on drafts; it runs on validated records," said Santiago Nestares, co-founder of DualEntry. "The benchmark shows that AI can accelerate accounting workflows, but without system-level controls and validation, errors can quickly cascade through financial reporting."

The company tested 19 different generalist AI models (e.g., ChatGPT, Claude, Gemini) on 101 different accounting workflows that represent the core functions of a general accounting system. These include transaction classification, journal entry creation, accounts payable and receivable, bank reconciliation, financial reporting, month-end close, and conceptual accounting knowledge. These workflows were boiled down to a set of questions to pose to the AI models.

Asked for an example, Ignacio Brasca, a staff software engineer who worked directly on the project, shared one in an email: "'Bright Ideas Marketing LLC received a bank transaction for $450.00 paid to Staples on 2025-03-15. What account should this bank transaction be classified under? $450 payment to Staples on 2025-03-15. Name the account and account type.' The actual question also had bracketed instructions to guide the AI, which should answer something along the lines of 'Office Supplies.'" 

Questions were designed against a provisioned chart of accounts and a minimal context capable of providing the information required for the questions to function without loading too much information into the prompt. Each benchmark ran in an isolated environment per organization, without any link to a real account in the system. Each was agnostic to the others. The grading was deterministic, so there was no "reasoning" behind the answers beyond a simple binary-logic decision. Each benchmark was allowed to run multiple times.

"Essentially the model isn't doing math but it's doing accounting with the tools we ingest before on the setup before each test runs," said Brasca in an email. 

What they found was that the big general models were not very good at accounting. OpenAI's ChatGPT 5.4 got the highest score at 77.3% accuracy, followed by Gemini 3.1 Pro, which scored 66%, followed by Z.ai GLM-5 at 65.3%. Most models scored below 65% accuracy, and older models like GPT-4 scored as little as 19.8%. 

However, the test also showed that while no model was especially good at accounting, there were still clear strengths and weaknesses. For instance, when it came to recalling information, such as discussing questions regarding GAAP/IFRS, most models scored very well. But when it came to actually creating structured records, scores dropped significantly. 

"The most interesting split we see: A model can score 92% on transaction classification (picking the right account for a bank charge) but drop to 30-40% on journal entry creation, where it needs to produce a multiline entry with exact debits/credits. Classification is pattern matching; record creation is structured reasoning with constraints. Bank reconciliation is another interesting one: models that are good at arithmetic tend to do well (90%+), but models that 'hallucinate' intermediate steps or skip the deposit-in-transit adjustment fail hard," he said, adding that he was surprised by how especially bad a lot of AI models were at tasks like this. 

Asked why they did so poorly, he said that one factor is lack of domain context, as the general models are trained on broad Internet data versus deep exposure to accounting standards, workflows, and edge cases. He also said they have only limited access to external tools and data, as opposed to specialized business and accounting AIs (such as that offered by DualEntry), which often integrate with databases, calculators, or retrieval systems versus relying just on their training data. And third, dedicated systems are usually fine-tuned on financial datasets and real accounting scenarios, which gives them a clear advantage for these specialized tasks.

The results could be sobering for the 82% who recently said in a poll that they trust AI with financial advice and guidance, along with the nearly one in two respondents who believe AI is superior to all the people in their life when it comes to providing financial information and guidance. 

While Brasca said the point wasn't to crown the "best model," he said DualEntry did want to demystify them somewhat in order to better gauge just how suitable they are for accounting work. 

"Most public benchmarks test general reasoning or knowledge questions. That's very different from how accounting software works in practice. Inside an ERP, the model isn't writing text — it has to create structured financial records like journal entries, bills, and reconciliations with the correct accounts, amounts, and line items. So we built a benchmark that mirrors how an accounting copilot actually operates," he said. 


For reprint and licensing requests for this article, click here.
Artificial intelligence Technology Bookkeeping Automation Accounting software
MORE FROM ACCOUNTING TODAY