I was trying to get up to speed on a new topic for a project and got really tired of juggling a dozen PDFs. You know the drill: download, skim, lose track of which paper had that one good paragraph, repeat.
So I hacked together a little Streamlit app to help with this.
Basically, you give it a topic and some websites (like arxiv), and it tries to find and download a bunch of PDFs for you. Then it chews on them for a while and gives you a chat box so you can ask questions about the stuff inside.
The tech part that might be interesting: I didn't want to do just a basic vector search on text chunks. I'd read the RAPTOR paper and thought it was a cool idea, so I tried to implement it. It recursively clusters the text chunks and then uses an LLM to summarize each cluster, building up a tree. The hope was it might give more high-level, synthesized answers. The whole multi-step process is held together with LangGraph.
It's pretty rough right now. PDF parsing is a nightmare, as usual, so it probably messes up on papers with complex layouts. And the indexing part is kinda slow if you give it a lot to read. It works with local Ollama models or Gemini if you plug in an API key.
I don't have a live demo up because the indexing would probably cook a cheap server, but it should run locally without too much fuss.
I'm posting it here mostly to see if this is a problem anyone else has, and if the way I'm trying to solve it makes any sense. If you end up trying it, I'd love to know what breaks or what you think would make it more useful.
scraper01•2h ago
The tech part that might be interesting: I didn't want to do just a basic vector search on text chunks. I'd read the RAPTOR paper and thought it was a cool idea, so I tried to implement it. It recursively clusters the text chunks and then uses an LLM to summarize each cluster, building up a tree. The hope was it might give more high-level, synthesized answers. The whole multi-step process is held together with LangGraph. It's pretty rough right now. PDF parsing is a nightmare, as usual, so it probably messes up on papers with complex layouts. And the indexing part is kinda slow if you give it a lot to read. It works with local Ollama models or Gemini if you plug in an API key.
I don't have a live demo up because the indexing would probably cook a cheap server, but it should run locally without too much fuss. I'm posting it here mostly to see if this is a problem anyone else has, and if the way I'm trying to solve it makes any sense. If you end up trying it, I'd love to know what breaks or what you think would make it more useful.
The code's here: https://github.com/andres-ulloa-de-la-torre/deep-search-acad...
Thanks.