The pipeline uses five agents: a retriever selects reference diagrams via in-context learning, a planner drafts the layout, a stylist adjusts for conference aesthetics, a visualizer renders with Gemini, and a critic evaluates and refines over three rounds.
The part that took the most effort was the reference dataset. The paper curates 292 (text, diagram, caption) tuples from 2,000 NeurIPS papers, filtering by aspect ratio and human review. Reproducing that required PDF layout extraction with MinerU, positional heuristics to identify methodology sections (paper headings are wildly inconsistent), and manual verification of each example.
Output quality depends heavily on reference set quality. Requesting community to submit their papers via issues so we can add them. Quality examples in, quality output out!
Runs on Gemini's free tier. Also includes an MCP server if you want to use it from your IDE. https://github.com/llmsresearch/paperbanana