Topaz et al. reported their findings on citation hallucination in May in The Lancet. They scanned 2.5 million PubMed Central articles and estimated that 1 in 277 contained a fabricated citation. Some of their examples were this exact pattern: real identifier, fabricated title.
I originally built Scholar Sidekick as a formatter for my own use as a clinician-educator preparing talks, articles, etc. After reading the Topaz paper, I added a verifier to catch the most common pattern they found: a real identifier attached to the wrong paper.
My tool resolves the identifier, and then compares the title in your reference with the returned metadata (i.e. does this DOI, PMID, or arXiv ID actually point to the right paper?). It does not attempt to judge whether the cited paper actually supports the claim you make in your text. That still needs judgment, preferably human judgment.
I ran 350 previously unseen citations through the API once each in a test. It correctly identified all 37 fabricated references, but wrongly flagged 5 of 285 real references: 1.8% (95% CI 0.8–4.0%). (Plain similarity comparison, without the optional LLM screening - I would expect the LLM to rescue some of those borderline cases. A handful of citations returned no result on upstream timeouts and weren't scorable either way.) The test suite, results and failures are public, so you do not have to take my word for it. You can check them yourself.
The web version is free and anonymous. The REST API and MCP server use a RapidAPI key, with a free rate-limited tier and paid tiers above that. The MCP server is on npm, Smithery and Glama, and the Obsidian plugin is in the community store. Chrome/Firefox/Edge browser extensions in their stores as well.
I'm very open to feedback and look forward to hearing from anyone who tries it - what works? What fails? Thanks in advance.
ProductivePhys•35m ago
---
A few more points that didn't quite fit in my main post:
My citation verifier is not a wrapper around a language model. It is deterministic. It takes identifier(s), looks them up in authoritative lists (Crossref, NCBI eutils, DataCite, arXiv, ADS, WHO IRIS), and then compares their associated title and author(s) to yours.
I do normalise tricky things: html markup, unicode characters, punctuation, different cases, stop words etc. Then, a similarity score is calculated using token overlap and edit distance. This is harder than it looks! The biggest difficulty was determining reasonable thresholds. Too sensitive and you will flag legitimate variations; too loose and you will fail to catch fabrication. I used the validation fixture to tune this but am deliberately publishing the confidence level it produces rather than claiming a hard pass/fail binary.
The verifier actually performed less well the first time that I did a blind eval; with 5.3% of real citations flagging as mismatches. The problem was extremely simple - I hadn't allowed for author names recorded with initials first. After I fixed that, drew a new citation set, (so it couldn't have been tuned to that test set) and re-ran; this is the result published above which flags 1.8% as false positives. I've published both runs and the receipts, not just the latter.
The web SaaS addresses one of the two potential problems with citation verification: 'Real DOI but wrong title' can be mechanically checked against the underlying system. 'Real article but doesn't support claim' is far harder. To address that requires reading the claim and the paper. I'm deliberately not trying to solve that problem. The furthest automation can easily go at that level appears to be something like: 'the abstract to the cited article appears to not contain any of the concepts contained in the claim'. Sometimes useful, but easy to overstate.
The web SaaS is closed source; due to ongoing hosting and service costs which the anonymous free tier subsidises.
Yes, I am aware there are other tools that solve different problems: retraction watch for withdrawn papers; unpaywall for open-access; Scite for context analysis of citations. However, none directly answer what Topaz et al. Identified as the most common pattern of fabrication: "Is this citation real and correctly attributable to this identifier?"
Areas for ongoing work: the edge-cases will be addressed, and the validation corpus expanded. Later; possibly a streaming / batch verifier for large reference lists, or a conservative semantic-layer flag based on abstract-vs-claim concept overlap. Both of those carry significant risks of over-promising, particularly the last.
Keen to hear thoughts on the project.