PDF-based pipelines are fundamentally lossy and compute-heavy—whether they rely on OCR, GROBID, or LLM-based parsing. They're simply not good enough for accurate, scientific agents at scale.
To fix this, I'm launching ScienceStack API: a lossless, node-based API for scientific papers with LaTeX source, starting with arXiv.
It currently covers 150k+ arXiv papers, mainly in CS, Math, and Physics.
I’m giving away 5× 3-month Pro keys to early commenters who are building in this space (scientific tooling, agents, copilots, RAG etc). I’d love to hear what you’re working on
cjlooi•1d ago
To fix this, I'm launching ScienceStack API: a lossless, node-based API for scientific papers with LaTeX source, starting with arXiv.
It currently covers 150k+ arXiv papers, mainly in CS, Math, and Physics.
Every paper also ships with a WYSIWYG interactive reader at sciencestack.ai/paper/{arxivId}. Example: https://www.sciencestack.ai/paper/2512.24601v1
I’m giving away 5× 3-month Pro keys to early commenters who are building in this space (scientific tooling, agents, copilots, RAG etc). I’d love to hear what you’re working on