frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

My 11-step GraphRAG pipeline, what worked, and what's still broken

2•pauliusztin•1h ago
While building a financial assistant for an SF start-up, we learned that AI frameworks add complexity without value. When I started building a personal assistant with GraphRAG, I carried that lesson but still tried LangChain's MongoDBGraphStore. It gave me a working knowledge graph in 10 minutes.

Then I looked at the data. I had 17 node types and 34 relationship types from just 5 documents, including three versions of "part of". GraphRAG is a data modeling problem, not a retrieval problem.

The attached diagram shows the full 11-step pipeline I ended up with. Here is a walkthrough of what you can learn from each step.

So basically, in steps 1 and 2 of the data pipeline, raw sources go through an Extract, Transform, Load (ETL) process. They land as documents in a MongoDB data warehouse. Each document stores the source type, URI, content, and metadata.

Then in step 3, we clean the documents and split them into token-bounded chunks. We started with 512 tokens with a 64-token overlap. Still, we have to run more tests on this.

The thing is, step 4 handles graph extraction. We defined a strict ontology. An ontology is just a formal contract defining exactly what categories and relationships exist in your data. We used 6 node types and 8 edge types. The LLM can only extract what this ontology allows.

For example, if it outputs a PERSON to TASK connection with an EXPERIENCED edge, the pipeline rejects it. EXPERIENCED must connect a PERSON to an EPISODE.

We also split LLM extraction from deterministic extraction. We create structural entries like Document or Chunk nodes without LLM calls.

Turns out, step 5 for normalization is the hardest part. We use a three-phase deduplication process. We do in-memory fuzzy matching, cross-document resolution against MongoDB, and edge remapping.

Anyway, in step 6, we batch embed the nodes. The system uses a mock for tests, Sentence Transformers for development, and the Voyage API for production.

Ultimately, in steps 7 and 8, nodes and edges are stored in a single MongoDB collection as unified memory. We use deterministic string IDs like "person:alice" to prevent duplicates. MongoDB handles documents, $vectorSearch, $text, and $graphLookup in one aggregation pipeline. The $graphLookup function natively traverses connected graph data directly in the database. You don't need Neo4j + Pinecone + Postgres for most agent use cases. A single database like MongoDB gets the job done really well. Through sharding, you can scale it up to a billion records.

To wrap it up, steps 9 through 11 cover retrieval. The agent calls tools through an MCP server. It uses search memory with hybrid vector, text, and graph expansion, alongside query memory for natural language to MongoDB aggregation. The agent also uses ingest tools to write back to the database for continual learning.

Here are a few things I am still struggling with and would love your opinion on:

How are you handling entity/relationship resolution across documents?

What helped you the most to optimize the extraction of entities/relationships using LLMs?

How do you keep embeddings in sync after graph updates?

Also, while building my personal assistant, I have been writing about this system on LinkedIn over the past few months. Here are the posts that go deeper into each piece:

- 3 ways to run embedding models: https://www.linkedin.com/feed/update/urn:li:activity:7443288346153480192

- LangChain gave me a knowledge graph in 10 minutes: https://www.linkedin.com/feed/update/urn:li:activity:7440751582381494272

- Palantir built a $400B empire on ontology-first AI: https://www.linkedin.com/feed/update/urn:li:activity:7434591082367320064

- Ingestion architecture for Digital Twin agent: https://www.linkedin.com/feed/update/urn:li:activity:7432054336589021184

- Most AI agents don't need three databases: https://www.linkedin.com/feed/update/urn:li:activity:7426981104227856385

Sam Altman's sister amends lawsuit accusing OpenAI CEO of sexual abuse

https://www.independent.co.uk/news/world/americas/sam-altman-sexual-assault-sister-annie-abuse-la...
1•therobots927•55s ago•0 comments

How to Detect Phishing Pages

https://linkshieldapi.com/blog/fake-dhl-delivery-page
1•TimLeland•1m ago•0 comments

28 Times faster than Google's LLM tokenizer

https://o200k-tokenizer-70fe25.gitlab.io/
1•nispin•1m ago•0 comments

Texas A&M research links high-dose antioxidants to offspring birth defects

https://stories.tamu.edu/news/2026/02/02/more-isnt-always-better-texas-am-research-links-high-dos...
1•clumsysmurf•6m ago•1 comments

Spectral Packet Engine – Python spectral analysis, compression, and MCP

https://github.com/farukalpay/spectral-packet/
1•hellomas1•7m ago•1 comments

German men 18-45 need military permit to leave country for longer than 3 months

https://www.dw.com/en/german-men-need-military-permit-for-extended-stays-abroad/a-76662677
2•L_226•7m ago•0 comments

Nursing Is the Surefire New Path to Prosperity

https://www.wsj.com/economy/jobs/nursing-jobs-pay-prosperity-b2769391
1•lxm•7m ago•0 comments

We may have seen a 'dirty fireball' star explosion for the first time

https://www.newscientist.com/article/2522015-we-may-have-seen-a-dirty-fireball-star-explosion-for...
1•Brajeshwar•8m ago•0 comments

I Asked Claude Why It Won't Stop Flattering Me

https://nautil.us/i-asked-claude-why-it-wont-stop-flattering-me-1279510
1•Brajeshwar•8m ago•0 comments

Limiting Not Just Screen Time, but Screen Space

https://www.noemamag.com/limiting-not-just-screen-time-but-screen-space/
2•thinkingemote•8m ago•0 comments

My AI Secretary

https://deknijf.com/posts/my-ai-secretary/
1•rdeknijf•9m ago•0 comments

Make Every Problem Claude's Problem

https://deknijf.com/posts/make-every-problem-claudes-problem/
1•rdeknijf•9m ago•0 comments

Elon Musk insists banks working on SpaceX IPO must buy Grok subscriptions

https://arstechnica.com/tech-policy/2026/04/elon-musk-insists-banks-working-on-spacex-ipo-must-bu...
4•throw0101c•9m ago•0 comments

Got an awkward or embarrassing Gmail address? Google is letting users change it

https://apnews.com/article/gmail-email-google-internet-27fc7d9ec927ee7aef2e4b1716667338
1•rawgabbit•11m ago•0 comments

Show HN: React hooks that predict text height before render, using font metrics

2•ahmadparizaad•11m ago•0 comments

Show HN: Built a habit tracker that I can't ignore

https://github.com/timf34/habit-wallpaper
1•timf34•12m ago•0 comments

Can you Lisp without being strapped in to the Torment Nexus Machine?

https://zyd.lol/lisp-against-the-machine.html
1•Antibabelic•13m ago•0 comments

Show HN: MarketChacha – a Reddit-style community for stock market traders

https://marketchacha.com
1•rsingh867•14m ago•0 comments

Reasoning.json – DKIM for AI Agents (Ed25519-Signed Brand Context for LLMs)

https://github.com/SaschaDeforth/arp-protocol
1•Deforth•14m ago•0 comments

AWS Engineer Reports PostgreSQL Performance Halved by Linux 7.0

https://www.phoronix.com/news/Linux-7.0-AWS-PostgreSQL-Drop
1•mesofile•16m ago•0 comments

Has there been any significant progress on using AI to talk to animals?

https://www.smithsonianmag.com/smart-news/google-is-training-a-new-ai-model-to-decode-dolphin-cha...
2•evolve2k•17m ago•1 comments

Epic Decline

https://superjoost.substack.com/p/epic-decline
4•RickJWagner•18m ago•0 comments

Show HN: Pscan – a macOS menu bar app to monitor your localhost ports

2•dogancna•19m ago•0 comments

Test your interpretability techniques by de-censoring Chinese models

https://www.alignmentforum.org/posts/7gp76q4rWLFi6sFqm/test-your-interpretability-techniques-by-d...
1•allenleee•26m ago•0 comments

Show HN: sllm – Split a GPU node with other developers, unlimited tokens

https://sllm.cloud
2•jrandolf•27m ago•0 comments

Show HN: See what a contract could cost you before you sign

https://joyful-granita-8415bc.netlify.app/
3•Meirambek_VIDI•28m ago•4 comments

Hims and Hers warns of data breach after Zendesk support ticket breach

https://www.bleepingcomputer.com/news/security/hims-and-hers-warns-of-data-breach-after-zendesk-s...
3•Brajeshwar•29m ago•0 comments

Searching for Unknown Unknowns

https://enopdf.com/blog/searching-for-unknown-unknowns/
1•gcassie•29m ago•0 comments

Human Proof System with EDNA

https://claude.ai/public/artifacts/74af0308-e2b3-4bd7-af54-3f5fb106c687
2•sebklaey•30m ago•0 comments

Easy-Live2d v0.4.0: A Milestone Release for Live2D on the Web

https://github.com/Panzer-Jack/easy-live2d
1•Panzer_Jack•32m ago•0 comments