My literature review process starts with a broad search to find a few key papers/groups, and from there expands along their citation networks. I needed to conduct a few rounds of literature reviews during the course of my research and decided to build a tool to facilitate this process. The tool started as an experimental wrapper over low-level statistical software in C, quickly became a testing/iteration ground for our api, and is now my personal go-to for lit reviews.
The tool organizes corpuses of text content, visualizes the high level themes, and enables me to pull up relevant excerpts. Unlike LLMs, this model transparently organizes the data and can train from scratch quickly on small datasets to learn custom hierarchical taxonomies. My favorite part of the tool is the citation network integration: any research paper it pulls up has a button “Citation Network Deep Dive” that pulls every paper that cites or is cited by the original paper, and organizes it for further exploration.
I initially built this tool for academic research, but ended up extending it to support Hacker News to mine technical conversation, the top 200 Google results, and earnings transcripts. We have a gallery of ready to explore results on the homepage. If you are kicking off a custom deep dive, it takes about 1-5 minutes for academic search, 3-7 minutes for Hacker News, and 5-10 minutes for Google. To demonstrate the process, I put together a video walkthrough of a short literature review I conducted on AI hallucinations: https://www.youtube.com/watch?v=OUmDPAcK6Ns
I host this tool on my company’s website, free for personal use. I’d love to know if the HN community finds it useful (or to hear what breaks)!
kianN•1d ago
Under the hood, this model resembles LDA, but replaces its Dirichlet priors with Pitman–Yor Processes (PYPs), which better capture the power-law behavior of word distributions. It also supports arbitrary hierarchical priors, allowing metadata-aware modeling.
For example, in an earnings-transcript corpus, a typical LDA might have a flat structure: Prior → Document
Our model instead uses a hierarchical graph: Uniform Prior → Global Topics → Ticker → Quarter → Paragraph
This hierarchical structure, combined with the PYP statistics, consistently yields more coherent and fine-grained topic structures than standard LDA does. There’s also a “fast mode” that collapses some hierarchy levels for quicker runs; it’s a handy option if you’re curious to see the impact hierarchy has on the model results (or in a rush).
malshe•6h ago
kianN•6h ago
We have some more technical write-ups on the internals of the model that are not hosted publicly (we have some on-going publication efforts applying those model to scRNA sequencing). But feel free to shoot me an email (in my profile) and I'd be happy to send over some of our more technical documents.
malshe•5h ago
johnhoffman•3h ago
What is the go to "production" stack for something like this nowadays? Is Stan dead? Do you do HMC or approximations with e.g. Pyro?
kianN•3h ago
Above C we built a python wrapper to help construct arbitrary Dirichlet and Pitman-Yor Processes graphs.
From there we have some python wrappers and store it all in a hierarchical DuckDB schema for fast query access.
The site itself is actually just a light wrapper around our API that simplifies this process.