I've been toying with an idea of a new format that stores text naturally and captures semantics (e.g. to help with table parsing), but also preserves formatting rules so you can still achieve fairly consistent rendering. This format could be easily converted to PDF, although the opposite conversion would have the regular challenges. The main challenge is distribution of course.
Your transactions are probably already available in CSV.
Or if anything I'll add it to the projects-that-already-do-this-but-havent-yet-found list.
The JSON extract actually looks pretty good and seems to produce something usable in one shot, which is very good compared to all the other tools I've tried so far, but I still need to check it more in-depth.
Sharing here in case someone chimes in with "hey, doofus, $magic_project already solves this."
I'm looking for a library that can extract data tables from PDF and can be called from a C++ program (for https://www.easydatatransform.com). If anyone can suggest something, I'm all ears.
clueless•4mo ago
simonw•4mo ago
The bigger question is how well they perform - there are needle-in-haystack benchmarks that test that, they're mostly scoring quite highly on those now.
https://cloud.google.com/blog/products/ai-machine-learning/t... talks about that for Gemini 1.5.
Here's a couple of relevant leaderboards: https://huggingface.co/spaces/RMT-team/babilong and https://longbench2.github.io/
clueless•4mo ago
simonw•4mo ago
ranger_danger•4mo ago
lysecret•4mo ago
mekael•4mo ago