frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

1•au-ai-aisl•3m ago•0 comments

AI-native capabilities, a new API Catalog, and updated plans and pricing

https://blog.postman.com/new-capabilities-march-2026/
1•thunderbong•4m ago•0 comments

What changed in tech from 2010 to 2020?

https://www.tedsanders.com/what-changed-in-tech-from-2010-to-2020/
2•endorphine•9m ago•0 comments

From Human Ergonomics to Agent Ergonomics

https://wesmckinney.com/blog/agent-ergonomics/
1•Anon84•12m ago•0 comments

Advanced Inertial Reference Sphere

https://en.wikipedia.org/wiki/Advanced_Inertial_Reference_Sphere
1•cyanf•14m ago•0 comments

Toyota Developing a Console-Grade, Open-Source Game Engine with Flutter and Dart

https://www.phoronix.com/news/Fluorite-Toyota-Game-Engine
1•computer23•16m ago•0 comments

Typing for Love or Money: The Hidden Labor Behind Modern Literary Masterpieces

https://publicdomainreview.org/essay/typing-for-love-or-money/
1•prismatic•17m ago•0 comments

Show HN: A longitudinal health record built from fragmented medical data

https://myaether.live
1•takmak007•19m ago•0 comments

CoreWeave's $30B Bet on GPU Market Infrastructure

https://davefriedman.substack.com/p/coreweaves-30-billion-bet-on-gpu
1•gmays•31m ago•0 comments

Creating and Hosting a Static Website on Cloudflare for Free

https://benjaminsmallwood.com/blog/creating-and-hosting-a-static-website-on-cloudflare-for-free/
1•bensmallwood•36m ago•1 comments

"The Stanford scam proves America is becoming a nation of grifters"

https://www.thetimes.com/us/news-today/article/students-stanford-grifters-ivy-league-w2g5z768z
1•cwwc•41m ago•0 comments

Elon Musk on Space GPUs, AI, Optimus, and His Manufacturing Method

https://cheekypint.substack.com/p/elon-musk-on-space-gpus-ai-optimus
2•simonebrunozzi•49m ago•0 comments

X (Twitter) is back with a new X API Pay-Per-Use model

https://developer.x.com/
3•eeko_systems•56m ago•0 comments

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

https://github.com/dmtrKovalenko/zlob
3•neogoose•59m ago•1 comments

Show HN: Deterministic signal triangulation using a fixed .72% variance constant

https://github.com/mabrucker85-prog/Project_Lance_Core
2•mav5431•1h ago•1 comments

Scientists Discover Levitating Time Crystals You Can Hold, Defy Newton’s 3rd Law

https://phys.org/news/2026-02-scientists-levitating-crystals.html
3•sizzle•1h ago•0 comments

When Michelangelo Met Titian

https://www.wsj.com/arts-culture/books/michelangelo-titian-review-the-renaissances-odd-couple-e34...
1•keiferski•1h ago•0 comments

Solving NYT Pips with DLX

https://github.com/DonoG/NYTPips4Processing
1•impossiblecode•1h ago•1 comments

Baldur's Gate to be turned into TV series – without the game's developers

https://www.bbc.com/news/articles/c24g457y534o
2•vunderba•1h ago•0 comments

Interview with 'Just use a VPS' bro (OpenClaw version) [video]

https://www.youtube.com/watch?v=40SnEd1RWUU
2•dangtony98•1h ago•0 comments

EchoJEPA: Latent Predictive Foundation Model for Echocardiography

https://github.com/bowang-lab/EchoJEPA
1•euvin•1h ago•0 comments

Disablling Go Telemetry

https://go.dev/doc/telemetry
1•1vuio0pswjnm7•1h ago•0 comments

Effective Nihilism

https://www.effectivenihilism.org/
1•abetusk•1h ago•1 comments

The UK government didn't want you to see this report on ecosystem collapse

https://www.theguardian.com/commentisfree/2026/jan/27/uk-government-report-ecosystem-collapse-foi...
5•pabs3•1h ago•0 comments

No 10 blocks report on impact of rainforest collapse on food prices

https://www.thetimes.com/uk/environment/article/no-10-blocks-report-on-impact-of-rainforest-colla...
3•pabs3•1h ago•0 comments

Seedance 2.0 Is Coming

https://seedance-2.app/
1•Jenny249•1h ago•0 comments

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

https://apps.apple.com/us/app/fitspire-5-minute-workout/id6758784938
2•devavinoth12•1h ago•0 comments

Dexterous robotic hands: 2009 – 2014 – 2025

https://old.reddit.com/r/robotics/comments/1qp7z15/dexterous_robotic_hands_2009_2014_2025/
1•gmays•1h ago•0 comments

Interop 2025: A Year of Convergence

https://webkit.org/blog/17808/interop-2025-review/
1•ksec•1h ago•1 comments

JobArena – Human Intuition vs. Artificial Intelligence

https://www.jobarena.ai/
1•84634E1A607A•1h ago•0 comments
Open in hackernews

Gemini Embedding: Powering RAG and context engineering

https://developers.googleblog.com/en/gemini-embedding-powering-rag-context-engineering/
278•simonpure•6mo ago

Comments

bryan0•6mo ago
The Matryoshka embeddings seem interesting:

> The Gemini embedding model, gemini-embedding-001, is trained using the Matryoshka Representation Learning (MRL) technique which teaches a model to learn high-dimensional embeddings that have initial segments (or prefixes) which are also useful, simpler versions of the same data. Use the output_dimensionality parameter to control the size of the output embedding vector. Selecting a smaller output dimensionality can save storage space and increase computational efficiency for downstream applications, while sacrificing little in terms of quality. By default, it outputs a 3072-dimensional embedding, but you can truncate it to a smaller size without losing quality to save storage space. We recommend using 768, 1536, or 3072 output dimensions. [0]

looks like even the 256-dim embeddings perform really well.

[0]: https://ai.google.dev/gemini-api/docs/embeddings#quality-for...

simonw•6mo ago
The Matryoshka trick is really neat - there's a good explanation here: https://huggingface.co/blog/matryoshka

I've seen it in a few models now - Nomic Embed 1.5 was the first https://www.nomic.ai/blog/posts/nomic-embed-matryoshka

alach11•6mo ago
OpenAI did it a few weeks earlier when they released text-embedding-3-large, right?
simonw•6mo ago
Huh, yeah you're right: that was January 25th 2024 https://openai.com/index/new-embedding-models-and-api-update...

Nomic 1.5 was February 14th 2024: https://www.nomic.ai/blog/posts/nomic-embed-matryoshka

ACCount36•6mo ago
Google teams seem to be in love with that Matryoshka tech. I wonder how far that scales.
OutOfHere•6mo ago
It's a practical feature. Scaling is irrelevant in this context because it scales to the length of the embedding, although in batches of k-length embeddings.
thefourthchime•6mo ago
It's interesting, but the improvement they're claiming isn't that groundbreaking.
OutOfHere•6mo ago
Does OpenAI's text-embedding-3-large or text-embedding-3-small embedding model have the Matryoshka property?
minimaxir•6mo ago
They do, they just don't advertise it well (and only confirmed it with a footnote after criticism of its omission): https://openai.com/index/new-embedding-models-and-api-update...

> Both of our new embedding models were trained with a technique that allows developers to trade-off performance and cost of using embeddings. Specifically, developers can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties by passing in the dimensions API parameter. For example, on the MTEB benchmark, a text-embedding-3-large embedding can be shortened to a size of 256 while still outperforming an unshortened text-embedding-ada-002 embedding with a size of 1536.

mvieira38•6mo ago
To anyone working in these types of applications, are embeddings still worth it compared to agentic search for text? If I have a directory of text files, for example, is it better to save all of their embeddings in a VDB and use that, or are LLMs now good enough that I can just let them use ripgrep or something to search for themselves?
philip1209•6mo ago
Semantic search is still important. I'd say that regex search is also quickly rising in importance, especially for coding agents.
pjm331•6mo ago
With the caveat that I have not spent a serious amount of time trying to get RAG to work - my brief attempt to use it via AWS knowledge base to compare it vs agentic search resulted in me sticking with agentic search (via Claude code SDK)

My impression was there’s lots of knobs you can tune with RAG and it’s just more complex in general - so maybe there’s a point where the amount of text I have is large enough that that complexity pays off - but right now agentic search works very well and is significantly simpler to get started with

simonw•6mo ago
If your LLM is good enough you'll likely get better results from tool calling with grep or a FTS engine - the better models can even adapt their search patterns to search for things like "dog OR canine" where previously vector similarity may have been a bigger win.

Getting embeddings working takes a bunch of work: you need to decide on a chunking strategy, then run the embeddings, then decide how best to store them for fast retrieval. You often end up having to keep your embedding store in memory which can add up for larger volumes of data.

I did a whole lot of work with embeddings last year but I've mostly lost interest now that tool-based-search has become so powerful.

Hooking up tool-based-search that itself uses embeddings is worth exploring, but you may find that the results you get from ripgrep are good enough that it's not worth the considerable extra effort.

whinvik•6mo ago
Curious but how do we take care of non text files. What if we had a lot of PDF files?
minimaxir•6mo ago
You can extract text from PDF files. (there's a number of dedicated models for that, but even the humble pandoc can do it)
sergiotapia•6mo ago
Use pymupdf to extract the PDF text. Hell, run that nasty business through an LLM as step-2 to get a beautiful clean markdown version of the text. Lord knows the PDF format is horribly complex!
elliotto•6mo ago
We OCR them with an LLM into markdown. Super expensive and slow but way more reliable than trying to decode insanely structured PDFs that users upload, which often include pages that are images of the text, or diagrams and figures that need to be read.

Really depends on your scale and speed requirements.

luke-stanley•6mo ago
There are plenty of vision capable embedding models, you might not need to OCR, and doing so may could improve or hurt performance.
elliotto•6mo ago
It depends on your use case and scale.

If you have a million records of unstructured text (very common, maybe website scrapes of product descriptions, user reviews, etc) you want to be doing an embedding search on these to get the most relevant docs.

If you have a hundred .py files than you want your agent to navigate through these with a grep tool

morkalork•6mo ago
Question to other GCP users, how are you finding Google's aggressive deprecation of older embedding models? Feels like you have to pay to rerun your data through every 12 months.
adregan•6mo ago
This is precisely the risk I’ve been wondering about with vectorization. I’ve considered that an open source model might be valuable as you could always find someone to host it for you and control the deprecation rate yourself.
throwaway-blaze•6mo ago
You know of lots of LLM-using apps that don't need to re-run their fine tunings or embeddings because of improvements or new features at least annually? Things are moving so fast that "every 12 months" seems kinda slow...
BoorishBears•6mo ago
My costs for embedding are so small compared to inference I don't generally notice.

But am I crazy or did the pre-production version of gemini-embedding-001 have a much larger max context length?

Edit: It seems like it did? 8k -> 2k? Huge downgrade if true, I was really excited about the experimental model reaching GA before that

asdev•6mo ago
I feel like tool calling killed RAG, however you have less control over how the retrieved data is injected in the context.
OutOfHere•6mo ago
How would you use tool-calling to filter through millions of documents? You need some search functionality, whether old-school search or embedding search. If you have only thousands of documents, then sure, you don't need search, as you can feed them all to the LLM.
kfajdsl•6mo ago
You give the LLM search tools.
OutOfHere•6mo ago
That's missing the point. You are hiding the search behind the tool, but it's still search. Whether you use a tool or a hardcoded workflow is irrelevant.
kridsdale1•6mo ago
I haven’t built either system but it seems clear that tool calling will be ‘O(num_targets * O(search tool))’, while RAG will be ‘O(embed_query * num_targets)’.

RAG looks linear (constant per lookup) while tools look polynomial. And tools will possibly fill up the limited LLM context too.

billmalarky•6mo ago
Search tool calling is RAG. Maybe we should call it a "RAG Agent" to be more en vogue heh. But RAG is not just similarity search on embeddings in vector DBs. RAG is any type of a retrieval + context injection step prior to inference.

Heck, the RAG Agent could run cosign diff on your vector db in addition to grep, FTS queries, KB api calls, whatever, to do wide recall (candidate generation) then rerank (relevance prioritization) all the results.

You are probably correct that for most use cases search tool calling makes more practical sense than embeddings similarity search to power RAG.

visarga•6mo ago
> could run cosign diff on your vector db

or maybe even "cosine similarity"

billmalarky•6mo ago
word ;)
gnulinux•6mo ago
Tool calling complements RAG. You build a full scale RAG (embedding, reranker, create prompt, get output from LLM) and hook that to a tool another agent can see. That combines both their power.
stillpointlab•6mo ago
> Embeddings are crucial here, as they efficiently identify and integrate vital information—like documents, conversation history, and tool definitions—directly into a model's working memory.

I feel like I'm falling behind here, but can someone explain this to me?

My high-level view of embedding is that I send some text to the provider, they tokenize the text and then run it through some NN that spits out a vector of numbers of a particular size (looks to be variable in this case including 768, 1536 and 3072). I can then use those embeddings in places like a vector DB where I might want to do some kind of similarity search (e.g. cosine difference). I can also use them to do clustering on that similarity which can give me some classification capabilities.

But how does this translate to these things being "directly into a model's working memory'? My understanding is that with RAG I just throw a bunch of the embeddings into a vector DB as keys but the ultimate text I send in the context to the LLM is the source text that the keys represent. I don't actually send the embeddings themselves to the LLM.

So what is is marketing stuff about "directly into a model's working memory."? Is my mental view wrong?

NicholasD43•6mo ago
You're right on this. "Advanced" RAG techniques are all complete marketing BS, in the end all you're doing it passing the text into the model's context window.
letitgo12345•6mo ago
LLMs can use search engines as a tool. One possibility is Google embeds the search query through these embeddings and does retrieval using them and then the retrieved result is pasted into the model's chain of thought (which..unless they have an external memory module in their model, is basically the model's only working memory).
stillpointlab•6mo ago
I'm reading the docs and it does not appear Google keeps these embeddings at all. I send some text to them, they return the embedding for that text at the size I specified.

So the flow is something like:

1. Have a text doc (or library of docs)

2. Chunk it into small pieces

3. Send each chunk to <provider> and get an embedding vector of some size back

4. Use the embedding to:

4a. Semantic search / RAG: put the embeddings in a vector DB and do some similarity search on the embedding. The ultimate output is the source chunk

4b. Run a cluster algorithm on the embedding to generate some kind of graph representation of my data

4c. Run a classifier algorithm on the embedding to allow me to classify new data

5. The output of all steps in 4 is crucially text

6. Send that text to an LLM

At no point is the embedding directly in the models memory.

sailingparrot•6mo ago
> So what is is marketing stuff about "directly into a model's working memory."? Is my mental view wrong?

Context is sometimes called working memory. But no your understanding is right: find the right document through cosine similarity (and thus through embeddings), then add the content of those docs to the context

greymalik•6mo ago
One of the things I find confusing about this article is that the author positions RAG as being unrelated to both context engineering and vector search.
yazaddaruvala•6mo ago
At least in theory. If the model is the same, the embeddings can be reused by the model rather than recomputing them.

I believe this is what they mean.

In practice, how fast will the model change (including tokenizer)? how fast will the vector db be fully backfilled to match the model version?

That would be the “cache hit rate” of sorts and how much it helps likely depends on some of those variables for your specific corpus and query volumes.

stillpointlab•6mo ago
> the embeddings can be reused by the model

I can't find any evidence that this is possible with Gemini or any other LLM provider.

yazaddaruvala•6mo ago
Yeah given what your saying is true and continues to be,

Seems the embeddings would just be useful for a “nice corpus search” mechanism for some regular RAG.

ivape•6mo ago
LLMs can’t take embeddings (unless I’m really confused). Even if it could take embeddings, the embeddings would have lost all word sequence and structure (wouldn’t make sense to the LLM).
d4rkp4ttern•6mo ago
This can’t be what they mean. Even if this was somehow possible, Embeddings lose information and are not reversible, I.e embeddings do not magically compress actual text into a vector in a way that a model can implicitly recover the source text from the vector.
fine_tune•6mo ago
RAG is taking a bunch of docs, chunking them it to text blocks of a certain length (how best todo this up for debate), creating a search API that takes query (like a google search) and compares it to the document chunks (very much how your describing). Take the returned chunks, ignore the score from vector search, feed those chunks into a re-ranker with the original query (this step is important vector search mostly sucks), filter those re-ranked for the top 1/2 results and then format a prompt like;

The user ask 'long query', we fetched some docs (see below), answer the query based on the docs (reference the docs if u feel like it)

Doc1.pdf - Chunk N Eat cheese

Doc2.pdf- Chunk Y Dont eat cheese

You then expose the search API as a "tool" for the LLM to call, slightly reformatting the prompt above into a multi turn convo, and suddenly you're in ze money.

But once your users are happy with those results they'll want something dumb like the latest football scores, then you need a web tool - and then it never ends.

To be fair though, its pretty powerful once you've got in place.

criddell•6mo ago
Is RAG how I would process my 20+ year old bug list for a piece of software I work on?

I've been thinking about this because it would be nice to have a fuzzier search.

fine_tune•6mo ago
Yes and no, for human search - its kinda neat, you might find some duplicates, or some nearby neighbour bugs that help you solve a whole class of issues.

But the cool kids? They'd do something worse;

They'd define some complicated agentic setup that cloned your code base into containers firewalled off from the world, give prompts like;

Your expert software dev in MY_FAVE_LANG, here's a bug description 'LONG BUG DESCRIPTION' explore the code and write a solution. Here's some tools (read_file, write_file, ETC)

You'd then spawn as many of these as you can, per task, and have them all generate pull requests for the tasks. Review them with an LLM, then manually and accept PR's you wanted. Now your in the ultra money.

You'd use RAG to guide an untuned LLM on your code base for styles and how to write code. You'd write docs like "how to write an API, how to write a DB migration, ETC" and give that as tool to the agents writing the code.

With time and effort, you can write agents to be specific to your code base through fine tuning, but who's got that kind of money?

CartwheelLinux•6mo ago
You'd be surprised how many people are actually doing this exact kind of solutioning.

It's also not that costly to do if you think about the problem correctly

If you continue down the brute forcing route you can do mischievous things like sign up for thousands and thousands of free accounts across numerous network connections to LLM APIs and plug away

Squakie•6mo ago
I feel called out, lmao. I’m building an agentic framework for automated pentesting as part of an internal AppSec R&D initiative. My company’s letting me run wild with infrastructure and Bedrock usage (bless their optimism). I’ve been throwing together some admittedly questionable prototypes to see what sticks.

The setup is pretty basic: S3 for docs and code base, pgvector on RDS for embeddings, Claude/Titan for retrieval and reasoning. It works in the sense that data flows through and responses come out… but the agents themselves are kind of a mess.

They think they’ve found a bug, usually something like a permissive IAM policy or a questionable API call, and just latch onto it. They tunnel hard, write up something that sounds plausible, and stop there. No lateral exploration, no attempt to validate anything in a dev environment despite having MCP tools to access internal resources, and definitely no real exploitation logic.

I’ve tried giving them tools like CodeQL, semgrep and Joern, but that’s been pretty disappointing. They can run basic queries, but all they surface are noisy false positives, and they can’t reason their way out of why it might be a false positive early on. There’s no actual taint analysis or path tracing, just surface-level matching and overconfident summaries. I feel like I’m duct-taping GPT-4 to a security scanner and hoping for insight.

I’ve experimented with splitting agents into roles (finder, validator, PoC author, code auditor, super uber hacker man), giving them memory, injecting skepticism, etc., but it still feels like I’m missing something fundamental.

If cost isn’t an issue, how would you structure this differently? How do you actually get agents to do persistent, skeptical, multi-stage analysis, especially in security contexts where you need depth and proof, not just plausible-sounding guesses and long ass reports on false positives?

quinnjh•6mo ago
Seems like you need a way to dictate structured workflows, in lieu of actually being able to train them up as soc analyst. Sounds like a fun problem!
ubercow13•6mo ago
You could try just exporting it as one text or XML file and seeing if it fits in Genini's context
criddell•6mo ago
I don't think it will. Gemini Pro has a context window of 2 million tokens which they say translates to around 1.5 million words. We have on the order of 100,000 logged issues and a typical issue description is around 500 words.
base698•6mo ago
Or you find your users search for id strings like k1231o to find ref docs and end up needing key word search and reranking.
Valk3_•6mo ago
Sorry for my lack of knowledge, but I've been wondering what if you ask a question to the RAG, where the answer to the question is not close in embedding space to the embedded question? Will that not limit the quality of the result? Or how does a RAG handle that? I guess maybe the multi-turn convo you mentioned helps in this regard?

The way I see RAG is it's basically some sort of semantic search, where the query needs to be similar to whatever you are searching for in the embedding space order to get good results.

yencabulator•6mo ago
I think the trick is called "query expansion". You use an LLM to rewrite the query into a more verbose form, which can also include text from the chat context, and then you use that as the basis for the RAG lookup. Basically you use an LLM to give the RAG a better chance of having the query be similar to the resources.
Valk3_•6mo ago
Thanks for the answer! I think you are right, I've also heard of HYDE (Hypothetical answer generation), that makes an LLM encode a guess as the answer into the query, which may also improve the results.
visarga•6mo ago
Oh what you don't understand is that LLMs also use embeddings inside, it's how they represent tokens. It's just that you don't get to see the embeddings, they are inner workings.
tcdent•6mo ago
Your mental model is correct.

They're listing applications of that by third parties to demonstrate the use-case, but this is just a model for generating those vectors.

rao-v•6mo ago
The directly into working memory bit is nonsense of course, but it does point to a problem that is probably worth solving.

What would it take to make the KV cache more portable and cut/paste vs. highly specific to the query?

In theory today, I should be able to process <long quote from document> <specific query> and just stop after the long document and save the KV cache right? The next time around, I can just load it in, and continue from <new query>?

To keep going, you should be able to train the model to operate so that you can have discontinous KV cache segments that are unrelated, so you can drop in <cached KV from doc 1> <cached KV from doc 2> with <query related to both> and have it just work ... but I don't think you can do that today.

I seem remember seeing some papers that tried to "unRoPE" the KV and then "re-RoPE" it, so it can be reused ... but I have not seen the latest. Anybody know what the current state is?

Seems crazy to have to re-process the same context multiple times just to ask it a new query.

whimsicalism•6mo ago
would loading the KV cache from disk be faster than just recomputing it?

imo the discontinuous segments bit would not work because of the causal dependence in transformers + RoPE as you mention, but maybe could be possible

gettincrafty•6mo ago
Do you have any links to the papers for the “unRoPE” and “re-Rope” technique? I tried some searching and couldn’t find anything. I would love to look into this idea more.

I think that copy/paste-able KV cache idea sounds pretty promising. It might lose some of the inter-document context and attention that would get built up in the hidden state of the model as it processes the prompt. Maybe just throw in some ‘reasoning’ tokens before it gives its answer to give it a chance to attend cross-document

yorwba•6mo ago
> In theory today, I should be able to process <long quote from document> <specific query> and just stop after the long document and save the KV cache right?

People do this, it's called prefix caching.

There's also https://arxiv.org/abs/2506.06266 where they compress the context down to a smaller representation they call a "cartridge," and composing cartridges from different contexts seems to work reasonably well.

ivape•6mo ago
Perhaps the person that wrote it is also confused. I guess Geminis embedding model offers multilingual support, but we can use anything. The assumption is the developer uses these embeddings on their end with their implementation of storage/querying (their own vector db). The confusing thing is the article is suggesting that whole process is now done automatically soon as you send the embeddings to Gemini (which doesn’t even make sense, shouldn’t it only take text?).
taw1285•6mo ago
Your comment really helps me improve my mental model about LLM. Can someone smarter help me verify my understanding:

1) at the end of the day, we are still sending raw text over LLM as input to get output back as response.

2) RAG/Embedding is just a way to identify a "certain chunk" to be included in the LLM input so that you don't have to dump the entire ground truth document into LLM Let's take Everlaw for example: all of their legal docs are in embeddings format and RAG/tool call will retrieve relevant document to feed into LLM input.

So in that sense, what do these non-foundational models startups mean when they say they are training or fine tuning models? Where does the line end between inputting into LLM vs having them baked in model weights

wrs•6mo ago
(1) and (2) are correct (well, I don’t know specifics of Everlaw). Fine tuning is something different, where you incrementally train the model itself further using more inputs, so that given the same input context it will produce better output in your use case.

To be more precise, you seldom directly continue training the model, because it’s much cheaper and easier to add some more small layers to the big model and train those instead (see LoRA or Peft).

Something like Everlaw might do all three, by fine tuning a model to do better at discovery retrieval, then building a RAG system on top of that.

mijoharas•6mo ago
What open embeddings models would people recommend. Still Nomic?
christina97•6mo ago
The Qwen3 embedding models were released recently and do very well on benchmarks.
hereme888•6mo ago
I'm using the qwen3 4B model in consumer hardware, which beats Gemini in English language tasks.
gnulinux•6mo ago
Qwen3 is the open weight state of the art at the moment. Qwen3-embedding-8B and Qwen3-reranker-8b are surprisingly good (according to some benchmarks, better than Gemini 2.5 embedding). 4B is also nearly as good so you might as well use that too unless 8B benefits your usecase. If you don't need a SOTA-precise embedding model because you'll run a more powerful reranker, you could run qwen3-embedding-4B at Q4 which is only 2GB, and will process extremely fast in most hardware. A weaker but close choice is `Qwen3-Embedding-0.6B` at Q8 which is about 600MB and will run just fine on most powerful CPUs. So if that does the job for you, you may not even need GPU, just grab an instance with 16 vCPUs, that'll give you plenty of throughput, probably more than you need until your RAG has thousands of active users.
dmezzetti•6mo ago
It's always worth checking out the MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard

There are some good open models there that have longer context limits and fewer dimensions.

The benchmarks are just a guide. It's best to build a test dataset with your own data. This is a good example of that: https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...

Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.

miohtama•6mo ago
> Everlaw, a platform providing verifiable RAG to help legal professionals analyze large volumes of discovery documents, requires precise semantic matching across millions of specialized texts. Through internal benchmarks, Everlaw found gemini-embedding-001 to be the best, achieving 87% accuracy in surfacing relevant answers from 1.4 million documents filled with industry-specific and complex legal terms, surpassing Voyage (84%) and OpenAI (73%) models. Furthermore, Gemini Embedding's Matryoshka property enables Everlaw to use compact representations, focusing essential information in fewer dimensions. This leads to minimal performance loss, reduced storage costs, and more efficient retrieval and search.

This will make a lot of junior lawyers or their work obsolete.

Here is a good podcast on the topic how will AI affect legal industry

https://open.spotify.com/episode/4IAHG68BeGZzr9uHXYvu5z?si=q...

dlojudice•6mo ago
It's really cool to see Odd Lots being mentioned here on HN. It's one of my favorite podcasts. However, I think the guest for this particular episode wasn't up to the task of answering questions and exploring the possibilities of using AI in the legal world.
jcims•6mo ago
I'm short on vocabulary here but it seems that using content embedding similarity to find relevant (chunks of) content to feed an LLM is orthogonal to the use of LLMs to take automatically curated content chunks and use them to enrich a context.

Is that correct?

I'm just curious why this type of content selection seems to have been popularized and in many ways become the defacto standard for RAG, and (as far as I know but I haven't looked at 'search' in a long time) not generally used for general purpose search?

elliotto•6mo ago
What do you mean by automatically curated content chunks? RAG with Embedding search is the process of deciding which chunks go into the context of the bot so that it can reference them to answer a user question
jcims•6mo ago
I guess I'm saying that over the past 30 years there have been a number of systems developed that take input from a user and find relevant bits of content from some corpus...aka 'search'.

Searches using vector embeddings are likely better at matching relevant semantics than most other systems, so they are an excellent candidate for RAG. However, if there's a system that's already working quite well at finding relevant content based on user input, then there wouldn't necessarily be any value in adding a vectorized search to the RAG pipeline. Just use the existing system to populate relevant content into the context.

Then the other half of my wondering is why the primary use case for vector databases appears (?) to be for RAG and not just a general purpose search engine.

elliotto•6mo ago
Ah I understand.

My startup provides a vector search system as part of its offering. A user can upload a dataset of records and build a vector index on one of its columns and perform searches. It honestly works incredibly well on a whole bunch of different domains and I was shocked at how useful it was out of the box compared to a conventional BM25 style keyword search. Since we've got this working it's completely changed the way I think about navigating unstructured text data.

If I have a dataset of 100k company website scrapes and I was looking for gyms, and I did a search for 'gym' I would get a whole bunch of conventional gyms. But I would miss companies that described themselves at fitness centers, or aquatic centers or MMA dojos. Vector search picks all of these up, but usually ranks them slightly lower.

If I'm building a RAG bot that is helping me look up companies and I search for a gym, I want the bot to have these extra companies in its context. I can do a vector similarity cutoff, but I can also do a #records cutoff, so that it always has the X most relevant records in its context window.

We've found the fuzziness of the vector search to be a problem in general purpose search cases because people write searches optimizing for keyword match. We had this problem with a company using it for a dataset with highly technical product codes that the embedding search was missing. Our solution was a hybrid keyword / vector search system for these guys that prioritized keyword match but also considered vector similarity. But it's still a big issue to communicate to the user what to write in the embedding search box - whereas in RAG the bot handles all of this.

I think it's an unsolved problem and there continuous to be enormous development in this space.

krackers•6mo ago
> not generally used for general purpose search

Possibly because up until now the performance of semantic based search wasn't worth the complexity tradeoff. I mean NLP was a hard problem, and we'd spent decades fine-tuning traditional keyword based search.

djoldman•6mo ago
It may be worth pointing out that a few open weights models score higher than gemini-embedding-001 on MTEB:

https://huggingface.co/spaces/mteb/leaderboard

Particularly Qwen3-Embedding-8B and Qwen3-Embedding-4B:

https://huggingface.co/Qwen/Qwen3-Embedding-8B

electroglyph•6mo ago
i don't think many people are having luck replicating those benchmarks, the models are a bit weird
asaddhamani•6mo ago
I can't trust MTEB as there's been a huge difference between benchmark scores and actual performance.

I made a small tool to help me compare various embedding models: https://www.vectorsimilaritytest.com/

Qwen embedding models score very highly but are highly sensitive to word order (they use last token pooling which simplified means they only look at the last word of the input). Change the word order and the scores change completely. Voyage models score highly too, but changing a word from singular to plural can again completely change the scores.

I find myself doing a hybrid search, rerank and shortlist the results, then feed them to an LLM to judge what is and isn't relevant.

keizo•6mo ago
has anyone done some simple latency profiling of gemini embedding vs open ai embedding api? seem like that api call is one of the biggest chunks of time in a simple rag setup.
elliotto•6mo ago
In my experience the api call is trivial compared to the time taken for the LLM to compose the response.
keizo•6mo ago
gemini flash and groq are pretty fast, and that part is streamable. curiosity got the best of me so i had claude code write a quick test. given this test is simply is 20 requests, with 1 second delay between requests ran once. so take with a grain of salt, but interesting still. Extra half second in a search is super noticeable so google looking like a reasonable improvement.

  OpenAI Statistics:

  - Average: 0.360 seconds
  - Median: 0.292 seconds
  - Min: 0.211 seconds
  - Max: 0.779 seconds
  - Std Dev: 0.172 seconds

  Google Gemini Statistics:

  - Average: 0.304 seconds
  - Median: 0.273 seconds
  - Min: 0.250 seconds
  - Max: 0.445 seconds
  - Std Dev: 0.066 seconds

  The key insights from these numbers:
  - Google has much lower standard deviation (0.066 vs 0.172), meaning more consistent/predictable performance
  - Google's worst-case (max) is much better than OpenAI's (0.445s vs 0.779s)
  - OpenAI had a slightly better best-case (min) performance (0.211s vs 0.250s)
  - Google's performance is more tightly clustered around its average, while OpenAI has more variability
aziis98•6mo ago
I'm just can't wait for a globally scaled rag system. I think that will be a turning point for search engines.

For now there is only https://exa.ai/ that is currently doing something similar it seems.

zapnuk•6mo ago
Good luck to anyone using it. We used it for embedding about 6k documents.

The API constantly gives you quota errors when you reach about 150 requests/min eventhough the quota should allow about 50_000 requests/min.

We’d like to use the Batch API, but the model isn’t available yet.

Quite a nice model though. Being able to get embeddings for a specific task type [1] is very interesting. We used classification specific embeddings and noticed a meaningful improvment when we used the embeddings as input for a classifier.

1: https://ai.google.dev/gemini-api/docs/embeddings#supported-t...

ofisboy•6mo ago
Same here.

I tested gemini embeddings api for 1 to 5,000ish social media comments. It filled up the quota almost immediately.

Since then, I’m just using qwen embeddings locally. Open source, free and relatively comparable.

curl-up•6mo ago
Anyone who has recently worked on embedding model finetuning, any useful tools you'd recommend (both for dataset curation and actual finetuning)? Any models you'd recommend as especially good for finetuning?

I'm interested in both full model finetunes, and downstream matrix optimization as done in [1].

[1] https://github.com/openai/openai-cookbook/blob/main/examples...

nikolayasdf123•6mo ago
no image support is a deal breaker. multi-modality is a must
nikolayasdf123•6mo ago
interesting. high quality optimized embeddings is very nice to have
jgalt212•6mo ago
Is one LLM embedding much better than another? To me, if you're building a vector database off embeddings, it's best and not punitive to stick to a self hosted public weights model.
TN1ck•6mo ago
VP of Engineering of re:cap here (featured in the article), if anybody has any more detailed questions, happy to answer!
_Chief•6mo ago
I have been thinking around solving this problem. I think one of the reasons some AI assistants shine vs others is how they can reduce the amount of context the LLM needs to work with using in-built tools. I think there's room to democratize these capabilities. One such capability is allowing the LLMs to directly work with the embeddings.

I wrote an MCP server directory-indexer[1] for this (self-hosted indexing mcp server). The goal being indexing any directories you want your AI to know about and gives the it MCP tools to search through the embeddings etc. While an agentic grep may be valuable, when working with tons of files with similar topics (like customer cases, technical docs), pre-processed embeddings have proven valuable for me. One reason I really like it is that it democratizes my data and documents: giving consistent results when working with different AI assistants - the alternative being vastly different results based on the in-built capabilities of the coding assistants. Another being having access to you "knowledge" from any project you're on. Though since this is selfhosted, I use nomic-embed-text for the embedding which has been sufficient for most use cases.

[1] https://github.com/peteretelej/directory-indexer

FeepingCreature•6mo ago
Shit man, I want a commandline tool that can grep embeddings. `emb-locate` when