Most Aussie banks only provide statements as a PDF, and generic converters often fail: columns drift, multi-line descriptions break parsing, headers shift. Existing tools don’t handle it well and I wanted a tool that just works.
To get started, I used my own bank statements to build the initial parsers. There was a "duh" moment when I realised how hard it is to get more realistic test data. People don't just hand over their financial ledgers. This solidified my core principle: trust and privacy had to be the absolute top priority.
I initially tried building everything client-side in JavaScript for maximum privacy, but performance and reliability were poor, and exposing the parsers on the front-end would have made them easy to copy.
I settled on a middle ground: a Python and FastAPI backend on Google Cloud Run. This lets me balance reliability with a strict privacy architecture. Files are processed in real-time and the temp file is deleted immediately after the request is complete. There is no persistent storage and no logging of request bodies.
My technical approach is straightforward and focused on reliability:
- I use pdfplumber to extract text, avoiding complex and error-prone OCR.
- I apply a set of bank-specific regex patterns to pinpoint dates, amounts, and descriptions.
- A lookahead heuristic correctly merges multi-line transactions. Each parser is customised to its bank's unique PDF layout quirks.
The project is deliberately focused. Instead of supporting hundreds of banks with mediocre results, I'm concentrating on a small set to get them right. It currently supports CommBank, Westpac, UBank, and ING, with ANZ and NAB next. The whole thing is deployed on Cloudflare Pages and outputs clean CSVs ready for Excel, Google Sheets, Xero, or MYOB.
It was a fun challenge in reverse-engineering messy, real-world data.
Try it out here: https://aussiebankstatements.com
I'd love to hear feedback. If it breaks on your statement, a redacted sample would be a huge help for improving the parser.
I'm also curious to hear how others here have tackled similar messy data extraction challenges.