I've been toying with an idea of a new format that stores text naturally and captures semantics (e.g. to help with table parsing), but also preserves formatting rules so you can still achieve fairly consistent rendering. This format could be easily converted to PDF, although the opposite conversion would have the regular challenges. The main challenge is distribution of course.
Or if anything I'll add it to the projects-that-already-do-this-but-havent-yet-found list.
The JSON extract actually looks pretty good and seems to produce something usable in one shot, which is very good compared to all the other tools I've tried so far, but I still need to check it more in-depth.
Sharing here in case someone chimes in with "hey, doofus, $magic_project already solves this."
clueless•2h ago
simonw•2h ago
The bigger question is how well they perform - there are needle-in-haystack benchmarks that test that, they're mostly scoring quite highly on those now.
https://cloud.google.com/blog/products/ai-machine-learning/t... talks about that for Gemini 1.5.
Here's a couple of relevant leaderboards: https://huggingface.co/spaces/RMT-team/babilong and https://longbench2.github.io/
clueless•1h ago
lysecret•1h ago
mekael•32m ago