frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Launch HN: Captain (YC W26) – Automated RAG for Files

https://www.runcaptain.com/
24•CMLewis•2h ago
Hi HN, we’re Lewis and Edgar, building Captain to simplify unstructured data search (https://runcaptain.com). Captain automates the building and maintenance of file-based RAG pipelines. It indexes cloud storage like S3 and GCS, plus SaaS sources like Google Drive. There’s a quick walkthrough at https://youtu.be/EIQkwAsIPmc.

We also put up this demo site called “Ask PG’s Essays” which lets you ask/search the corpus of pg’s essays, to get a feel for how it works: https://pg.runcaptain.com. The RAG part of this took Captain about 3 minutes to set up.

Here are some sample prompts to get a feel for the experience:

“When do we do things that don't scale? When should we be more cautious?” https://pg.runcaptain.com/?q=When%20do%20we%20do%20things%20...

“Give me some advice, I'm fundraising” https://pg.runcaptain.com/?q=Give%20me%20some%20advice%2C%20...

“What are the biggest advantages of Lisp” https://pg.runcaptain.com/?q=what%20are%20the%20biggest%20ad...

A good production RAG pipeline takes substantial effort to build, especially for file workloads. You have to handle ETL or text extraction, chunking, embedding, storage, search, re-ranking, inference, and often compliance and observability – all while optimizing for latency and reliability. It’s a lot to manage. grep works well in some cases, but for agents, semantic search provides significantly higher performance. Cursor uses both and reports 6.5%–23.5% accuracy gains from vector search over grep (https://cursor.com/blog/semsearch).

We’ve spent the past four years scaling RAG pipelines for companies, and Edgar’s work at Purdue’s NLP lab directly informed our chunking techniques. In conversations with dozens of engineers, we repeatedly saw DIY pipelines produce inconsistent results, even after weeks of tuning. Many teams lacked clarity on which retrieval strategies best fit their data.

We realized that a system to provision storage and embeddings, handle indexing, and continuously update pipelines to reflect the latest search techniques could remove the need for every team to rebuild RAG themselves. That idea became Captain.

In practice, one API call indexes URLs, cloud storage buckets, directories, or individual files. Under the hood, we’re converting everything to Markdown. For this, we’ve had good results with Gemini 3 Pro for images, Reducto for complex documents, and Extend for basic OCR. For embedding models, ‘gemini-embedding-001’ performed reasonably well at first, but we later switched to the Contextualized Embeddings from ‘voyage-context-3’. It produced more relevant results than even the newer Voyage 4 models because its chunk embeddings are encoded with awareness of the surrounding document context. We then applied Voyage’s ‘rerank-2.5’ as second-stage re-ranking, reducing 50 initial chunks to a final top 15 (configurable in Captain’s API). Dense embeddings are just half the picture and full-text search with RRF complete our hybrid retrieval. In the Captain API, these techniques are exposed through a single /query endpoint. Access controls can be configured via metadata filters, and page number citations are returned automatically.

The stack is constantly changing but the Captain API creates a standard interface for this. You can try Captain, 1 month for free, and build your own pipelines at https://runcaptain.com. We’re looking for candid feedback, especially anything that can make it more useful, and look forward to your comments!

Comments

jamiequint•1h ago
This is cool, like qmd as a service with real-time integrations where it matters?

How do you handle more structured data like csv/xlsx/json? Would be cool if it were possible to auto-process links to markdown (e.g. youtube, podcast, arbitrary websites, etc) a la https://github.com/steipete/summarize (which can pull full text in addition to summarizing).

CMLewis•1h ago
Thanks, we're just starting to optimize more for the semi-structured data. So far, we've been parsing tables into Markdown and running them through the contextualized embedding model with no overlap, taking advantage of how it strings together chunks. This isn't great for big files so we're exploring agentic exploration (slow but good for more structured numerical data) and automated graph creation (promising for more relational data).

Love the auto-process markdown idea, we'll add it to our roadmap :D

jzig•1h ago
> spotty RAG

:O

vg_head•1h ago
Good looking! I didn't get to watch the video or look at docs in depth, but do the results trace back to the location of the answers in a document? Let's say it finds an answer in a PDF, and I'd like to know where in that PDF the citation is. Is that possible or intended?
CMLewis•56m ago
Great question, we have deterministic page # citations for PDF results and exact bounding box citations coming very soon.

If you want to check out the Query API response example, here's a link: https://docs.runcaptain.com/api-reference/query/collection-v...

mchusma•48m ago
Having tried this a bit I do really like the single api call for all of it.

I also appreciate transparent pricing but I am not 100% sure the sense of scale of costs. It could be helpful to give some ballparks on things for each of the plans. I'm not sure exactly what i could get out of a plan. My guess, trying hard to figure it out, was if i had about 1,000 pages of new/updated content per month, I would pay $295/month for unlimited queries on top of it. Is that roughly correct?

edgarbabajanyan•4m ago
Yes, we don't charge for queries. For $295, you're able to index up to 1000 pages of new content per month into a fully queryable pipeline. That can be any of the sources we have listed in our docs :)

Meta Platforms: Lobbying, dark money, and the App Store Accountability Act

https://github.com/upper-up/meta-lobbying-and-other-findings
994•shaicoleman•7h ago•430 comments

Show HN: Channel Surfer – Watch YouTube like it’s cable TV

https://channelsurfer.tv
68•kilroy123•2d ago•48 comments

Can I run AI locally?

https://www.canirun.ai/
247•ricardbejarano•5h ago•65 comments

TUI Studio – visual terminal UI design tool

https://tui.studio/
400•mipselaer•7h ago•228 comments

Launch HN: Captain (YC W26) – Automated RAG for Files

https://www.runcaptain.com/
24•CMLewis•2h ago•7 comments

Launch HN: Spine Swarm (YC S23) – AI agents that collaborate on a visual canvas

https://www.getspine.ai/
62•a24venka•4h ago•51 comments

Bucketsquatting is (finally) dead

https://onecloudplease.com/blog/bucketsquatting-is-finally-dead
258•boyter•9h ago•136 comments

The Accidental Room (2018)

https://99percentinvisible.org/episode/the-accidental-room/
3•blewboarwastake•7m ago•0 comments

Willingness to look stupid

https://sharif.io/looking-stupid
656•Samin100•4d ago•226 comments

The Wyden Siren Goes Off Again: We'll Be "Stunned" by NSA Under Section 702

https://www.techdirt.com/2026/03/12/the-wyden-siren-goes-off-again-well-be-stunned-by-what-the-ns...
69•cf100clunk•1h ago•14 comments

Lost Doctor Who Episodes Found

https://www.bbc.co.uk/news/articles/c4g7kwq1k11o
88•edent•12h ago•23 comments

E2E encrypted messaging on Instagram will no longer be supported after 8 May

https://help.instagram.com/491565145294150
261•mindracer•4h ago•146 comments

Okmain: How to pick an OK main colour of an image

https://dgroshev.com/blog/okmain/
177•dgroshev•4d ago•41 comments

The Mrs Fractal: Mirror, Rotate, Scale (2025)

https://www.4rknova.com//blog/2025/06/22/mrs-fractal
27•ibobev•4d ago•3 comments

Gvisor on Raspbian

https://nubificus.co.uk/blog/gvisor-rpi5/
43•_ananos_•7h ago•8 comments

The Bovadium Fragments: Together with The Origin of Bovadium

https://kirkcenter.org/reviews/monster-is-the-machine/
36•freediver•4d ago•13 comments

Executing programs inside transformers with exponentially faster inference

https://www.percepta.ai/blog/can-llms-be-computers
249•u1hcw9nx•1d ago•92 comments

Why the militaries are scrambling to create their own Starlink

https://www.newscientist.com/article/2517766-why-the-worlds-militaries-are-scrambling-to-create-t...
12•mooreds•29m ago•1 comments

Show HN: What was the world listening to? Music charts, 20 countries (1940–2025)

https://88mph.fm/
81•matteocantiello•3d ago•36 comments

Dijkstra's Crisis: The End of Algol and Beginning of Software Engineering (2010) [pdf]

https://www.tomandmaria.com/Tom/Writing/DijkstrasCrisis_LeidenDRAFT.pdf
49•ipnon•4d ago•13 comments

Revealed: Face of 75,000-year-old female Neanderthal from cave

https://www.cam.ac.uk/stories/shanidar-z-face-revealed
17•thunderbong•54m ago•5 comments

“This is not the computer for you”

https://samhenri.gold/blog/20260312-this-is-not-the-computer-for-you/
849•MBCook•16h ago•318 comments

Run NanoClaw in Docker Sandboxes

https://nanoclaw.dev/blog/nanoclaw-docker-sandboxes/
106•outofdistro•4h ago•47 comments

What we learned from a 22-Day storage bug (and how we fixed it)

https://www.mux.com/blog/22-day-storage-bug
34•mmcclure•4d ago•5 comments

OVH forgot they donated documentation hosting to Pandas

https://github.com/pandas-dev/pandas/issues/64584
109•nwalters512•1h ago•32 comments

NASA targets Artemis II crewed moon mission for April 1 launch

https://www.npr.org/2026/03/12/nx-s1-5746128/nasa-artemis-ii-april-launch
41•Brajeshwar•2h ago•25 comments

ATMs didn’t kill bank teller jobs, but the iPhone did

https://davidoks.blog/p/why-the-atm-didnt-kill-bank-teller
500•colinprince•1d ago•525 comments

Removing recursion via explicit callstack simulation

https://jnkr.tech/blog/removing-recursion
4•todsacerdoti•4d ago•2 comments

Ceno, browse the web without internet access

https://ceno.app/en/index.html?
104•mohsen1•11h ago•29 comments

IMG_0416 (2024)

https://ben-mini.com/2024/img-0416
179•TigerUniversity•4d ago•42 comments