Ask HN: How are you doing RAG locally?

413•tmaly•3w ago

I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?

Comments

eajr•3w ago

Local LibreChat which bundles a vector db for docs.

whattheheckheck•3w ago

Anythingllm is promising

rahimnathwani•3w ago

If your data aren't too large, you can use faiss-cpu and pickle

https://pypi.org/project/faiss-cpu/

notyourwork•3w ago

For the uneducated, how large is too large? Curious.

itake•3w ago

FAISS runs in RAM. If your dataset can't fit into ram, FAISS is not the right tool.

hahahahhaah•3w ago

Shoud it be:

If the total size of your data isn't loo large...?

Data being a plural gets me.

You might have small datums but a lot of kilobytes!

pousada•3w ago

Data is technically a plural but nobody uses the singular and it’s being used as a singular term often - which is completely fine I think, nobody speaks Latin anyway

DonHopkins•3w ago

The opposite of Data is Lore.

motakuk•3w ago

LightRAG, Archestra as a UI with LightRAG mcp

ramesh31•3w ago

SQLite with FTS5

nineteen999•3w ago

A little BM25 can get you quite a way with an LLM.

jeffchuber•3w ago

try out chroma or better yet as opus to!

electroglyph•3w ago

simple lil setup with qdrant

CuriouslyC•3w ago

Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy.

lee1012•3w ago

static embedding models im finding quite fast lee101/gobed https://github.com/lee101/gobed is 1ms on gpu :) would need to be trained for code though the bigger code llm embeddings can be high quality too so its just yea about where is ideal on the pareto fronteir really , often yea though your right it tends to be bm25 or rg even for code but yea more complex solutions are kind of possible too if its really important the search is high quality

itake•3w ago

With AI needing more access to documentation, WDYT about using RAG for documentation retrieval?

CuriouslyC•3w ago

IME most documentation is coming from the web via web search. I like agentic RAG for this case, which you can achieve easily with a Claude Code subagent.

postalcoder•3w ago

I agree. Someone here posted a drop-in for grep that added the ability to do hybrid text/vector search but the constant need to re-index files was annoying and a drag. Moreover, vector search can add a ton of noise if the model isn't meant for code search and if you're not using a re-ranker.

For all intents and purposes, running gpt-oss 20B in a while loop with access to ripgrep works pretty dang well. gpt-oss is a tool calling god compared to everything else i've tried, and fast.

threecheese•3w ago

Say more!

ehsanu1•3w ago

I've gotten great results applying it to file paths + signatures. Even better if you also fuse those results with BM25.

CuriouslyC•3w ago

I like embeddings for natural language documents where your query terms are unlikely to be unique, and overall document direction is a good disambiguator.

rao-v•3w ago

Anybody know of a good service / docker that will do BM25 + vector lookup without spinning up half a dozen microservices?

donkeyboy•3w ago

Elasticsearch / Opensearch is the industry standard for this

abujazar•3w ago

Used to be, but they're very complicated to operate compared to more modern alternatives and have just gotten more and more bloated over the years. Also require a bunch of different applications for different parts of the stack in order to do the same basic stuff as e.g. Meilisearch, Manticore or Typesense.

cluckindan•3w ago

>very complicated to operate compared to more modern alternatives

Can you elaborate? What makes the modern alternatives easier to operate? What makes Elasticsearch complicated?

Asking because in my experience, Elasticsearch is pretty simple to operate unless you have a huge cluster with nodes operating in different modes.

abujazar•2w ago

Sure, I've managed both clusters and single node deployments in production until 2025 when I changed jobs. Elastic definitely does have its strengths, but they're increasingly enterprise-oriented and appear not to care a lot about open source deployments. At one point Elastic itself had a severe regression in an irreverible patch update (!?) which took weeks to fix, forcing us to recover from backup and recreate the index. The documentation is or has been ambigious and self-contradicting on a lot of points. The Debian Elastic Enterprise Search package upgrade script was incomplete, so there's a significant manual process for updating the index even for patch updates. The interfaces between the different components of the ELK stack are incoherent and there's literally a thousand ways to configure them. Default setups have changed a lot over the years, leading to incoherent documentation. You really need to be an expert at Elastic in order to run it well – or pay handsomely for the service. It's simply too complicated and costly for what it is, compared to more recent alternatives.

abujazar•3w ago

Meilisearch

porridgeraisin•3w ago

For BM25 + trigram, SQLite FTS5 works well.

cipherself•3w ago

Here's a Dockerfile that will spin up postgres with pgvector and paradedb https://gist.github.com/cipherself/5260fea1e2631e9630081fb7d...

You can use pgvector for the vector lookup and paradedb for bm25.

jankovicsandras•3w ago

You can do hybrid search in Postgres.

Shameless plug: https://github.com/jankovicsandras/plpgsql_bm25 BM25 search implemented in PL/pgSQL ( Unlicense / Public domain )

The repo includes also plpgsql_bm25rrf.sql : PL/pgSQL function for hybrid search ( plpgsql_bm25 + pgvector ) with Reciprocal Rank Fusion; and Jupyter notebook examples.

canadiantim•3w ago

Wow very impressive library great work!

Der_Einzige•3w ago

This is true in general with LLMs, not just for code. LLMs can be told that their RAG tool is using BM25+N-grams, and will search accordingly. keyword search is superior to embeddings based search. The moment google switched to bert based embeddings for search everyone agreed it was going down hill. Most forms of early enshittification were simply switching off BM25 to embeddings based search.

BM25/tf-idf and N grams have always been extremely difficult to beat baselines in information retrieval. This is why embeddings still have not led to a "ChatGPT" moment in information retrieval.

lee1012•3w ago

lee101/gobed https://github.com/lee101/gobed static embedding models so they are embedded in milliseconds and on gpu search with a cagra style on gpu index with a few things for speed like int8 quantization on the embeddings and fused embedding and search in the same kernel as the embedding really is just a trained map of embeddings per token/averaging

pdyc•3w ago

sqlite's bm25

init0•3w ago

I built a lib for myself https://pypi.org/project/piragi/

stingraycharles•3w ago

That looks great! Is there a way to store / cache the embeddings?

jeanloolz•3w ago

Sqlite-vec

petesergeant•3w ago

I’ve got it deployed in production for a dataset that changes infrequently and it works really well

ehsanu1•3w ago

Embedded usearch vector database. https://github.com/unum-cloud/USearch

dvorka•3w ago

Any suggestion what to use as embeddings model runtime and semantic search in C++?

cbcoutinho•3w ago

The Nextcloud MCP Server [0] supports Qdrant as a vectordb to store embeddings and provide semantic search across your personal documents. This enables any LLM & MCP client (e.g. claude code) into a RAG system that you can use to chat with your files.

For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.

[0] https://github.com/cbcoutinho/nextcloud-mcp-server

lormayna•3w ago

I have done some experiments with nomic embedding through Ollama and ChromaDB.

Works well, but I didn't tested on larger scale

tebeka•3w ago

https://duckdb.org/2024/05/03/vector-similarity-search-vss

m00dy•3w ago

does duckdb scale well over large datasets for vector search ?

lgrebe•3w ago

What order of magnitude would you define as „large“ in this case?

m00dy•3w ago

like over 1tb.

cess11•3w ago

Some people are using DuckDB for large datasets, https://duckdb.org/docs/stable/guides/performance/working_wi... , but you'd probably do some testing under the specific conditions of your rig to figure out if it is a good match or not.

riku_iki•3w ago

its clear many DuckDB sql queries can handle terabytes of data, but the question here was about vector search..

jlarks32•3w ago

+1 on this one, I've been pleasantly surprised by this for a small (<3GB) local project

autogn0me•3w ago

https://github.com/ggozad/haiku.rag/ - the embedded lancedb is convenient and has benchmarks; uses docling. qwen3-embedding:4b, 2560 w/ gpt-oss:20b.

miohtama•3w ago

+1 for Haiku! It's very simple to get up and running.

baalimago•3w ago

I thought that context building via tooling was shown to be more effective than rag in practically every way?

Question being: WHY would I be doing RAG locally?

petesergeant•3w ago

For code, maybe? For documents, no, text embeddings are magical alien technology.

beret4breakfast•3w ago

For the purposes of learning, I’ve built a chatbot using ollama, streamlit, chromadb and docling. Mostly playing around with embedding and chunking on a document library.

sidrag22•3w ago

i took a similar path, i spun up a discord bot, used ollama, pgvector, docling for random documents, and made some specialized chunking strategies for some clunkier json data. its been a little while since i messed with it, but i really did enjoy it when i was.

it all moves so fast, i wouldnt be surprised if everything i made is now crazy outdated and it was probably like 2 months ago.

Strift•3w ago

I just use a web server and a search engine.

TL;DR: - chunk files, index chunks - vector/hybrid search over the index - node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)

I wrote about it here: https://laurentcazanove.com/blog/obsidian-rag-api

esperent•3w ago

I'm lucky enough to have 95% of my docs in small markdown markdown files so I'm just... not (+). I'm using SQLite FTS5 (full text search) to build a normal search index and using that. Well, I already had the index so I just wired it up to my mastra agents. Each file has a short description field, so if a keyword search surfaces the doc they check the description and if it matches, load the whole doc.

This took about one hour to set up and works very well.

(+) At least, I don't think this counts as RAG. I'm honestly a bit hazy on the definition. But there's no vectordb anyway.

dmos62•3w ago

Retrieval-augmented generation. What you described is a perfect example of a RAG. An embedding-based search might be more common, but that's a detail.

esperent•3w ago

Well, that is what the acronym stands for. But every source I've ever seen quickly follows by noting it's retrieval backed by a vectordb. So we'd probably find an even split of people who would call this RAG or not.

xpe•3w ago

What are your sources?

The backing method doesn’t matter as long as it works. This is clear from good RAG survey papers, Wikipedia, and (broadly) understanding the ethos of machine learning engineers and researchers: specific implementation details are usually means to an end, not definitional boundaries.

This may be of interest:

https://github.com/ibm-self-serve-assets/Blended-RAG

> So we'd probably find an even split of people who would call this RAG or not.

Maybe but not likely. This is sometimes called the 50-50 fallacy or the false balance of probability or the equiprobability bias.

https://pmc.ncbi.nlm.nih.gov/articles/PMC4310748/

“The equiprobability bias (EB) is a tendency to believe that every process in which randomness is involved corresponds to a fair distribution, with equal probabilities for any possible outcome. The EB is known to affect both children and adults, and to increase with probability education. Because it results in probability errors resistant to pedagogical interventions, it has been described as a deep misconception about randomness: the erroneous belief that randomness implies uniformity. In the present paper, we show that the EB is actually not the result of a conceptual error about the definition of randomness.”

You can also find an ELI5 Reddit thread on this topic where one comment summarizes it as follows:

“People are conflating the number of distinguishable outcomes with the distribution of probability directly.”

https://www.reddit.com/r/explainlikeimfive/comments/1bpor68/...

spqw•3w ago

I am surprised to see very few setups leveraging LSP support. (Language Server Protocol) It has been added to Claude Code last month. Most setups rely on naive grep.

woggy•3w ago

I've written a few terminal tools on top of Roslyn to assist Claude in code analysis for C# code. Obviously the tools are also written with the help of Claude. Worked quite well.

aqula•3w ago

LSP is not great for non-editor use cases. Everything is cursor position oriented.

HarHarVeryFunny•3w ago

Yes, something like TreeSitter would seem to be of more value - able to lookup symbols by name, and find the spans of source code where they are defined and used.

alchemist1e9•3w ago

https://github.com/ast-grep/ast-grep

HarHarVeryFunny•3w ago

I don't see ast-grep as being very useful to an agent.

What a coding agent needs is to be able to locate portions of source code relevant to what it has been tasked with, and preferably in more context-efficient fashion than just grepping and loading entire source files into context. One way to do this is something like Cursor's vector index of code chunks, and another would be something like TreeSitter (or other identifier-based tools) that knows where identifiers (variables, functions) are defined and used.

Language servers (LSP) are not useful for this task since they can't tell the agent "where is function foo() defined" (but TreeSitter can), since as someone else noted language servers are based on location (line number) not content (symbols). Language servers are designed to help editors.

It's possible that ast-grep might be some some use to a coding agent, but looking for syntax/AST patterns rather than just identifier definitions and usages seems a much more niche facility.

WilcoKruijer•3w ago

There are actions that don't require cursor position, like document/workspace symbols, that could be useful.

d4rkp4ttern•3w ago

LSP is currently broken in CC:

https://github.com/anthropics/claude-code/issues/15168

geuis•3w ago

I don't. I actually write code.

To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.

However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.

softwaredoug•3w ago

I built a Pandas extension SearchArray, I just use that (plus in memory embeddings) for any toy thing

https://github.com/softwaredoug/searcharray

Bombthecat•3w ago

AnythingLLM for documents, amazing tool!

lsb•3w ago

I'm using Sonnet with 1M Context Window at work, just stuffing everything in a window (it works fine for now), and I'm hoping to investigate Recursive Language Models with DSPy when I'm using local models with Ollama

bzGoRust•3w ago

In my company, we build the internal chatbot based on RAG through LangChain + Milvus + LLM. Since the documents are well formatted, it is easy to do the overlapping chunking, then all those chunking data are inserted into vector db Milvus. The hybrid search (combine dense search and sparse search) is native supported in the Milvus could help us to do better retrieve. Thus the better quality answers are there.

cluckindan•3w ago

Hybrid search usually refers to traditional keyword search (BM25, TF-IDF) combined with a vector similarity search.

__jf__•3w ago

For vector generation I started using Meta-LLama-3-8B in april 2024 with Python and Transformers for each text chunk on an RTX-A6000. Wow that thing was fast but noisy and also burns 500W. So a year ago I switched to an M1 Ultra and only had to replace Transformers with Apple's MLX python library. Approximately the same speed but less heat and noise. The Llama model has 4k dimensions so at fp16 thats 8 kilobyte per chunk, which I store in a BLOB column in SQLite via numpy.save(). Between running on the RTX and M1 there is a very small difference in vector output but not enough for me to change retrieval results, regenerate the vectors or change to another LLM.

For retrieval I load all the vectors from the SQlite database into a numpy.array and hand it to FAISS. Faiss-gpu was impressively fast on the RTX6000 and faiss-cpu is slower on the M1 Ultra but still fast enough for my purposes (I'm firing a few queries per day, not per minute). For 5 million chunks memory usage is around 40 GB which both fit into the A6000 and easily fits into the 128GB of the M1 Ultra. It works, I'm happy.

sinandrei•3w ago

Anyone use these approaches with academic pdfs?

urschrei•3w ago

Another approach is to teach Claude Code how to use your Zotero library's full-text search: https://github.com/urschrei/zotero_search_skill.

amelius•3w ago

Anyone using them for electronics datasheets?

bradfa•3w ago

I would like to. I haven't yet found a solution that works well.

The problems with datasheets is tables which span multiple pages, embedded images for diagrams and plots, they're generally PDFs, and only sometimes are they 2-column layout.

Converting from PDF to markdown while retaining tables correctly seems to work well for me with Mistral's latest OCR model, but this isn't an open model. Using docling with different models has produced much worse results.

sosojustdo•3w ago

I've been working on a tool specifically to handle these messy PDF-to-Markdown conversions because I ran into the same issues with tables and multi-column layouts.

I’ve optimized https://markdownconverter.pro/pdf-to-markdown to handle complex PDFs, including those tricky tables that span multiple pages and 2-column formats that usually trip up tools like Docling. It also extracts embedded diagrams/images and links them properly in the output.

Full disclosure: I'm the developer behind it. I’d love to see if it handles your specific datasheets better than the models you've tried. Feel free to give it a spin!

bradfa•3w ago

Cool! But given that often electronics documentation is covered by NDAs, my preferred solution is local-first if at all possible.

alansaber•3w ago

I've not seen any impressive products. But products do exist ie https://scibite.com/solutions/semantic-search/

podgietaru•3w ago

I made a small RAG database just using Postgres. I outlined it in the blog post below. I use it for RSS Feed organisation, and searching. They are small blobs. I do the labeling using a pseudo-KNN algorithm.

https://aws.amazon.com/blogs/machine-learning/use-language-e...

The code for it is here: https://github.com/aws-samples/rss-aggregator-using-cohere-e...

The example link no longer works, as I no longer work at AWS.

acutesoftware•3w ago

I am using LangChain with a SQLite database - it works pretty well on a 16G GPU, but I started running it on a crappy NUC, which also worked with lesser results.

The real lightbulb moment is when you realise the ONLY thing a RAG passes to the LLM is a short string of search results with small chunks of text. This changes it from 'magic' to 'ahh, ok - I need better search results'. With small models you cannot pass a lot of search results ( TOP_K=5 is probably the limit ), otherwise the small models 'forget context'.

It is fun trying to get decent results - and it is a rabbithole, next step I am going into is pre-summarising files and folders.

I open sourced the code I was using - https://github.com/acutesoftware/lifepim-ai-core

reactordev•3w ago

You can expand your context window to something like 100,000 to prevent memory loss.

IXCoach•2w ago

You can modify this, theres settings for - how much context - chunk size

We had to do this, 3 best matches but about 1000 characters each was far more effective than the default I ran into of 15-20 snippets of 4 sentences each

We also found a setting for "when do you cut off and/or start" the chunk, and set it to double new lines

Then just structured our agentic memory into meaningful chunks with 2 new lines between each, and it gelled perfectly.

( hope this helps )

SamLeBarbare•3w ago

sqlite + FTS + sqlite-vec + local LLM for reranking results (reasoning model)

yandrypozo•3w ago

this's pretty cool, which LLM are you using currently?

robotswantdata•3w ago

You don’t need a vector database or graph, it really depends on your existing infrastructure , file types and needs.

The newer “agent” search approach can just query a file system or api. It’s slightly slower but easier to setup and maintain as no extra infrastructure.

beklein•3w ago

Most of my complex documents are, luckily, Markdown files.

I can recommend https://github.com/tobi/qmd/ . It’s a simple CLI tool for searching in these kinds of files. My previous workflow was based on fzf, but this tool gives better results and enables even more fuzzy queries. I don’t use it for code, though.

Aachen•3w ago

Given that preface, I was really expecting that link to be a grepping tool rewritten in golang or something, or perhaps customised for markdown to weigh matches in "# heading title"s heavier for example

whacked_new•3w ago

Here's a rust one: https://github.com/BeaconBay/ck

I haven't used it extensively, but semantic grep alone was kind of worth it.

Aachen•3w ago

Right, I should have said Rust. Golang is so 2017!

codebolt•3w ago

Giving the LLM tools with an OData query interface has worked well for me. In C# it's pretty trivial to set up an MCP server with OData querying for an arbitrary data model. At work we have an Excel sheet with 40k rows which the LLM was able to quickly and reliably analyse using this method.

lmeyerov•3w ago

Claude code / codex which internally uses ripgrep, and I'm unsure if it's using parallel mode. And, project specific static analyzers.

Studies generally show when you do agentic retrieval w/ text search, that's pretty good. Adding vector retrieval and graph rag, so the typical parallel multi-retrieval followed by reranking, gives a bit of speedup and quality lift. That lines up with my local flow experience, where it is only enough that I want that for $$$$ consumer/prosumer tools, and not easy enough for DIY that I want to invest in that locally. For those who struggle with tools like spotlight running when it shouldn't, that kind of thing turns me off on the cost/benefit side.

For code, I experiment with unsound tools (semgrep, ...) vs sound flow analyzers, carefully setup for the project. Basically, ai coders love to use grep/sed for global replace refactors and other global needs, but keeps tripped up on sound flow analysis. Similar to lint and type checking, that needs to be setup for a project and taught as a skill. I'm not happy with any of my experiments here yet however :(

mmargenot•3w ago

Cursor uses a vector index, some details here: https://cursor.com/docs/context/semantic-search

lmeyerov•3w ago

Thanks!

Their discussion is super relevant to exactly what I wrote --

* They note speed benefits * The quality benefit they note is synonym search... which agentic text search can do: Agents can guess synonyms in the first shot for you, eg, `navigation` -> `nav|header|footer`, and they'll be iterating anyways

To truly do better, and not make the infra experience stink, it's real work. We do it on our product (louie.ai) and our service engagements, but real costs/benefits.

oliveiracwb•3w ago

We handle ~300k customer interactions per day, so latency and precision really matter. We built an internal RAG-based portal on top of our knowledge base (basically a much better FAQ).

On the retrieval side, I built a custom search/indexing layer (Node) specifically for service traceability and discovery. It uses a hybrid approach — embeddings + full-text search + IVF-HNSW — to index and cross-reference our APIs, services, proxies and orchestration repos. The RAG pipelines sit on top of this layer, which gives us reasonable recall and predictable latency.

Compliance and observability are still a problem. Every year new vendors show up promising audits, data lineage and observability, but none of them really handle the informational sprawl of ~600 distributed systems. The entropy keeps increasing.

Lately I’ve been experimenting with a more semantic/logical KAG approach on top of knowledge graphs to map business rules scattered across those systems. The goal is to answer higher-level questions about how things actually work — Palantir-like outcomes, but with explicit logic instead of magic.

Curious if others are moving beyond “pure RAG” toward graph-based or hybrid reasoning setups.

jackfranklyn•3w ago

For document processing in a side project, I've been using a local all-MiniLM model with FAISS. Works well enough for semantic matching against ~50k transaction descriptions.

The real challenge wasn't model quality - it was the chunking strategy. Financial data is weirdly structured and breaking it into sensible chunks that preserve context took more iteration than expected. Eventually settled on treating each complete record as a chunk rather than doing sliding windows over raw text. The "obvious" approaches from tutorials didn't work well at all for structured tabular-ish data.

pj4533•3w ago

Well this isn’t code, but I’ve been working on a memory system for Claude Code. This portion provides semantic search over the session files in .claude/projects. It uses OpenAI for embeddings so not completely local (would be easy to modify) and storage in ChromaDB.

https://github.com/pj4533/seance

reactordev•3w ago

I have three tools dedicated to this.

save_memory, recall_memory, search

Save memory vectorizes a session, summarizes it, and stores it in SQLite. Recall memory takes vector or a previous tool run id and loads the full text output. Search takes a vector array or string array and searches through the graph using fuzzy matching and vector dot products.

It’s not fancy, but it works really well. gpt-oss

yakkomajuri•3w ago

I've written about this (and the post was even here on HN) but mostly from the perspective of running a RAG on your infra as an organization. But I cover the general components and alternatives to Cloud services.

Not sure how useful it is for what you need specifically: https://blog.yakkomajuri.com/blog/local-rag

prakashn27•3w ago

I feel local rag system , slows down my computer (I got M1 Pro 32 GB)

So I use hosted one to prevent this. My business use vector db, so created a new db to vectorize and host my knowledge base. 1. All my knowledge base is markdown files. So I split that by header tags. 2. The split is hashed and hash value is stored in SQLite 3. The hashed version is vectorized and pushed to cloud db. 4. When ever I make changes , I run a script which splits and checks hash, if it is changed the. I upsert the document. If not I don’t do anything. This helps me keep the store up to date

For search I have a cli query which searches and fetches from vector store.

metawake•3w ago

I am using a vector DB using Docker image. And for debugging and benchmarking local RAG retrieval, I've been building a CLI tool that shows what's actually being retrieved:

  ragtune explain "your query" --collection prod

Shows scores, sources, and diagnostics. Helps catch when your chunking or embeddings are silently failing or you need numeric estimations to base your judgements on.

Open source: https://github.com/metawake/ragtune

yokuze•3w ago

I made, and use this: https://github.com/libragen/libragen

It’s a CLI tool and MCP server for creating discrete, versioned “libraries” of RAG-able content.

Under the hood, it uses an embedding model locally. It chunks your content and stores embeddings in SQLite. The search functionality uses vector + keyword search + a re-ranking model.

You can also point it at any GitHub repo and it will create a RAG DB out of it.

You can also use the MCP server to create and query the libraries.

Site: https://www.libragen.dev/

bradfa•3w ago

Your README references a file named LICENSE which doesn't seem to exist on the main branch.

yokuze•3w ago

Fixed. Thank you!

navar•3w ago

For the retrieval stage, we have developed a highly efficient, CPU-only-friendly text embedding model:

https://huggingface.co/MongoDB/mdbr-leaf-ir

It ranks #1 on a bunch of leaderboards for models of its size. It can be used interchangeably with the model it has been distilled from (https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1...).

You can see an example comparing semantic (i.e., embeddings-based) search vs bm25 vs hybrid here: http://search-sensei.s3-website-us-east-1.amazonaws.com (warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)

This mini app illustrates the advantage of semantic vs bm25 search. For instance, embedding models "know" that j lo refers to jennifer lopez.

We have also published the recipe to train this type of models if you were interested in doing so; we show that it can be done on relatively modest hardware and training data is very easy to obtain: https://arxiv.org/abs/2509.12539

jasonjmcghee•3w ago

How does performance (embedding speed and recall) compare to minish / model2vec static word embeddings?

navar•3w ago

I interacted with the authors of these models quite a bit!

These are very interesting models.

The tradeoff here is that you get even faster inference, but lose on retrieval accuracy [0].

Specifically, inference will be faster because essentially you are only doing tokenization + a lookup table + an average. So despite the fact that their largest model is 32M params, you can expect inference speeds to be higher than ours, which 23M params but it is transformer-based.

I am not sure about typical inference speeds on a CPU for their models, but with ours you can expect to do ~22 docs per second, and ~120 queries per second on a standard 2vCPU server.

As far as retrieval accuracy goes, on BEIR we score 53.55, all-MiniLM-L12-v2 (a widely adopted compact text embedding model) scores 42.69, while potion-8M scores 30.43.

I can't find their larger models but you can generally get an idea of the power level of different embedding models here: https://huggingface.co/spaces/mteb/leaderboard

If you want to run them on a CPU it may make sense to filter for smaller models (e.g., <100M params). On the other side our models achieve higher retrieval scores.

[0] "accuracy" in layman terms, not in accuracy vs recall terms. The correct word here would be "effectiveness".

rcarmo•3w ago

Hmmm. I recently created https://github.com/rcarmo/asterisk-embedding-model, need to look at this since I had very limited training resources.

3abiton•3w ago

And honestly in a lot of the cases bm25 has been the best approach used in many of the projects we deployed.

HanClinto•3w ago

Thank you for publishing this! I absolutely love small embedding models, and have used them on a number of projects (both commercial and hobbyist). I look forward to checking this one out!

I don't know if this is too much to ask, but something that would really help me adopt your model is to include a fine-tuning setup. The BGE series of embeddings-models has been my go-to for a couple of years now -- not because it's the best-performing in the leaderboards, but because they make it so incredibly easy to fine-tune the model [0]. Give it a JSONL file of a bunch of training triplets, and you can fine-tune the base models on your own dataset. I appreciate you linking to the paper on the recipe for training this type of model -- how close to turnkey is your model to helping me do transfer learning with my own dataset? I looked around for a fine-tuning example of this model, and didn't happen to see anything, but I would be very interested in trying this one out.

Does support for fine-tuning already exist? If so, then I would be able to switch to this model away from BGE immediately.

* [0] - https://github.com/FlagOpen/FlagEmbedding/tree/master/exampl...

navar•3w ago

As far as I can tell it should be possible to reuse this fine tuning code entirely and just replace `--embedder_name_or_path BAAI/bge-base-en-v1.5` with `--embedder_name_or_path MongoDB/mdbr-leaf-ir`

Note that bge-base-en-v1.5 is a 110M params model - our is 23M. * BEIR performance is bge=53.23 vs ours=53.55 * RTEB performance is bge=43.75 vs ours=44.82 -> overall they should be very similar, except ours is 5x smaller and hence that much faster.

jacekm•3w ago

I am curious what are you using local RAG for?

scosman•3w ago

Kiln wraps up all the parts in on app. Just drag and drop in files. You can easily compare different configs on your dataset: extraction methods, embedding model, search method (BM25, hybrid, vector), etc.

It uses LanceDB and has dozens of different extraction/embedding models to choose from. It even has evals for checking retrieval accuracy, including automatically generating the eval dataset.

You can use its UI, or call the RAG via MCP.

https://github.com/kiln-ai/kiln

https://docs.kiln.tech/docs/documents-and-search-rag

gaganyatri•3w ago

Built discovery using - Qwen-3-VL-8B for Document Ocr + Prompts + Tool Call - ChromaDB for Vector storage. - BM25 + Embedding model for Hybrid RAG. - Backend- FastAPI + Python - Frontend- React + Typescript - vllm + docker for model deployment on L40 GPU

Demo: https://app.dwani.ai

GitHub: https://github.com/dwani-ai/discovery

Now working on added Agentic features, by continuous analysis of Document with Generated prompts.

juleshenry•3w ago

SurrealDB coupled with local vectorization. Mac M1 16GB

eb0la•3w ago

We started with PGVector just because we already knew Postgres and it was easy to hand over to the operations people.

After some time we noticed a semi-structured field in the prompt had a 100% match with the content needed to process the prompt.

Turns out operators started puting tags both in the input and the documents that needed to match on every use case (not much, about 50 docs).

Now we look for the field first and put the corresponding file in the prompt, then we look for matches in the database using the embedding.

85% of the time we don't need the vectordb.

alansaber•3w ago

Most vectordb is a hammer looking for a nail

folli•3w ago

I think it can be more efficient for two-step RAG so you can reuse the natural language query directly, but for agentic RAG it might indeed be overkill.

alansaber•3w ago

Exactly this, agree completely

juanre•3w ago

I built https://github.com/juanre/llmemory and I use it both locally and as part of company apps. Quite happy with the performance.

It uses PostgreSQL with pgvector, hybrid BM25, multi-query expansion, and reranking.

(It's the first time I share it publicly, so I am sure there'll be quirks.)

raghavankl•3w ago

I have a python tooling to do indexing and relevance offline using ollama.

https://github.com/raghavan/pdfgptindexer-offline

claylyons•3w ago

Has anyone tried this? https://aws.amazon.com/s3/features/vectors/

theahura•3w ago

SQLite works shockingly well. The agents know how to write good queries, know how to chain queries, and can generally manipulate the DB however they need. At nori (https://usenori.ai/watchtower) we use SQLite + vec0 + fts5 for semantic and word search

init0•3w ago

from piragi import Ragi

kb = Ragi(["./docs", "s3://bucket/data/*/*.pdf", "https://api.example.com/docs"])

answer = kb.ask("How do I deploy this?")

that's it! with https://pypi.org/project/piragi/

mooball•3w ago

i thought rag/embeddings were dead with the large context windows. thats what i get for listening to chatgpt.

amscotti•3w ago

More of a proof of concept to test out ideas, but here's my approach for local RAG, https://github.com/amscotti/local-LLM-with-RAG

Using Ollama for the embeddings with “nomic-embed-text”, with LanceDB for the vector database. Recently updated it to use “agentic” RAG, but probably not fully needed for a small project.

someguyiguess•3w ago

Woah. I am doing something very similar also using lancedb https://github.com/nicholaspsmith/lance-context

Mine is much more basic than yours and I just started it a couple of weeks ago.

threecheese•3w ago

There are so many of us doing the same, just had a similar conversation at $work. It’s pretty exciting. I feel like I’m having to shove another 20 years of development experience into my brain with all these new concepts and abstractions, but the dots have been connecting!

vaylian•3w ago

Thank you for being the kind of person who explains what the abbreviation RAG stands for. I have been very confused reading this thread.

someguyiguess•3w ago

I feel this pain! It feels like in the world of LLMs there is a new acronym to learn every day!

For the curious RAG = Retrieval Augmented Generation. From wikipedia: RAG enables large language models (LLMs) to retrieve and incorporate new information from external data sources

__mharrison__•3w ago

Grep (rg)

turnsout•3w ago

The Claude Code model highlights the power of simple search (grep) and selective reads (only reading in excerpts). The only time I vectorize is when I explicitly want to similarity-based searching, but that's actually pretty rare.

marwamc•3w ago

BM25 has been sufficient for my needs. I typically need to refer to codebases of existing tools as referential sources (istio, envoy, oauth2-proxy, tantivy index etc) so I just clone those repos, index them and search away. Built a cli and mcp tool for this workflow.

https://github.com/rhobimd-oss/shebe

One area where BM25 particularly shines is the refactoring workflow: let's say you want to upgrade your istio installation from 1.28 to 1.29 and maybe in 1.29 the authorizationpolicy crd has a breaking change in one of it's properties. BM25 allows you to efficiently enumerate all code locations in your codebase that need to change and then you can set the cli coders off using this list. Grep and LSP can still perform this enumeration but they have shortcomings. Wrote about it here https://github.com/rhobimd-oss/shebe/blob/main/WHY_SHEBE.md#...

tubs•3w ago

The download links for binaries 404 for me.

marwamc•3w ago

Will fix the links. Meanwhile here is the releases page. I develop on gitlab and mirror to github. Need to make that clear as well.

https://gitlab.com/rhobimd-oss/shebe/-/releases

tubs•3w ago

Ah, I tried the gitlab and the tarballs 404 for me there, sorry I should have been more specific in the original post!

fwiw this does look interesting.

marwamc•3w ago

I see what's happening. I never validated those build artifacts... Thanks for the catch. Will rebuild notify you here.

marwamc•2w ago

Got around to sorting the 404. Releases now work.

https://gitlab.com/rhobimd-oss/shebe/-/releases/v0.5.6-rc2

tschellenbach•3w ago

Vector & BM25 on Turbopuffer. (see https://github.com/GetStream/Vision-Agents/blob/main/plugins...)

andoando•3w ago

Anyone have suggestions for doing semantic caching?

philip1209•3w ago

I run a Mac Mini home datacenter [1]. I've been using Chroma, Qwen 0.6B embeddings, and gpt-oss-20b to build a search agent over my blog.

[1]: https://www.contraption.co/a-mini-data-center/

yandrypozo•3w ago

Is there a thread for hardware used for local LLMs?

mmargenot•3w ago

I made an obsidian extension that does semantic and hybrid (RRF with FTS) search with local models. I have done some knowledge graph and ontology experimentation around this, but nothing that I’d like to include yet.

This is specifically a “remembrance agent”, so it surfaces related atoms to what you’re writing rather than doing anything generative.

Extension: https://github.com/mmargenot/tezcat

Also available in community plugins.

g0wda•3w ago

Store fp16 vector blobs in sqlite. Load the vectors after filter queries into memory and do a matvec multiplication for similarity scores (this part will be fast if the library (e.g. numpy/torch) uses multithreading/blas/GPU). I will migrate this to the very based https://github.com/sqliteai/sqlite-vector when it starts to become a bottleneck. In my case the filters by other features (e.g. date, location) just subset a lot. All this is behind some interface that will allow me to switch out the backend.

throwaway7783•3w ago

We have a Q&A database. The questions, answers are both trigram indexed and also have embeddings. All in postgres. We then use pgvector + trigram search, combine them by relevance scores.

folli•3w ago

I was just working on a RAG implementation for >500k news articles, completely local, using postgres as a vector database: https://github.com/r-follador/TeletextSignals

I'm positively surprised on how well it works, especially if you also connect it to an LLM.

threecheese•3w ago

For my personal PKM slash “learn this crap”, I have a fully local hybrid search on my MacBook using MLX and SQLite.

I store file content blobs in SQLite, and use FTS5 (bm25) to maintain a fulltext index plus sqlite-vec for storing embeddings. Search uses both of these, and then reciprocal rank fusion gets the best results and pipes those to a local transformers model to judge. It’s all Python with mlx-lm and mlx-embeddings libraries, the models are grabbed from huggingface. It’s not the fastest, but it’s local and easy to understand (and for Claude to write, mostly).

xpl•3w ago

sqlite with extensions, scales to millions of docs easily

mach5•3w ago

can you tell me more

VerifiedReports•3w ago

Whatever "RAG" is...

ktyptorio•2w ago

I've just released a casual personal project for Ephemeral GraphRAG. It's still experimental and open source: https://github.com/gibram-io/gibram

IXCoach•2w ago

I have production agents which run vector search via FAISS locally ( in their env not 3rd party environments ), and for which I am creating embeddings for specific domains.

1 - agent memory ( its an ai coach so its the unique training methods that allow for instant adoption of new skills and distilling best fit skills for context )

2 - user memory ( the ai coaches memory of a user )

3 - session memory ( for long conversations, instead of compaction or truncation )

Then separately I have coding agents which I give semantic search, same system FAISS

- on command they create new memories from lessons ( consumes tokens * ) - they vector search FAISS when needing more context ( 2x greater agent alignment / outcomes this way )

And finally I forked openais codex terminal agent code to add - inbuilt vector search and injection

So I say "Find any uncovered TDD opportunity matching intent to actuality for auth on these 3 repos, write TDD coverage, and bring failures to my attention"

They set my message to {$query}

vector search on {$query}

embed results in their context window

programmatically - so no token consumption ( what a freaking dream )

thats open source if helpful

Its here

https://github.com/Next-AI-Labs-Inc/codex/tree/nextailabs

Im trying to determine where something like this fits in

https://huggingface.co/MongoDB/mdbr-leaf-ir

My gaps right now are ...

I am not training the agents yet, like fine tuning the underlying models.

Would love the simplest approach to test this, because at least with the codex clone I could easily swap out local models, but somehow doubting that they will be able to match performance of the outsourced models.

especially bc claude code just launched ahead of codex in the last week or so in quality, and they are closed source. Im seeing clear swarm agentic coding internally which is a dream for context window efficiency. ( in claude code as of today )

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Software factories and the agentic moment

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Ga68, a GNU Algol 68 Compiler

Sheldon Brown's Bicycle Technical Info

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Software factories and the agentic moment

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Ga68, a GNU Algol 68 Compiler

Sheldon Brown's Bicycle Technical Info

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

Ask HN: How are you doing RAG locally?

Comments