Show HN: Postgres extension for BM25 relevance-ranked full-text search

https://github.com/timescale/pg_textsearch

45•tjgreen•4h ago

Last summer we faced a conundrum at my company, Tiger Data, a Postgres cloud vendor whose main business is in timeseries data. We were trying to grow our business towards emerging AI-centric workloads and wanted to provide a state-of-the-art hybrid search stack in Postgres. We'd already built pgvectorscale in house with the goal of scaling semantic search beyond pgvector's main memory limitations. We just needed a scalable ranked keyword search solution too.

The problem: core Postgres doesn't provide this; the leading Postgres BM25 extension, ParadeDB, is guarded behind AGPL; developing our own extension appeared daunting. We'd need a small team of sharp engineers and 6-12 months, I figured. And we'd probably still fall short of the performance of a mature system like Parade/Tantivy.

Or would we? I'd be experimenting long enough with AI-boosted development at that point to realize that with the latest tools (Claude Code + Opus) and an experienced hand (I've been working in database systems internals for 25 years now), the old time estimates pretty much go out the window.

I told our CTO I thought I could solo the project in one quarter. This raised some eyebrows.

It did take a little more time than that (two quarters), and we got some real help from the community (amazing!) after open-sourcing the pre-release. But I'm thrilled/exhausted today to share that pg_textsearch v1.0 is freely available via open source (Postgres license), on Tiger Data cloud, and hopefully soon, a hyperscalar near you:

https://github.com/timescale/pg_textsearch

In the blog post accompanying the release, I overview the architecture and present benchmark results using MS-MARCO. To my surprise, we were not only able to meet Parade/Tantivy's query performance, but exceed it substantially, measuring a 4.7x advantage on query throughput at scale:

https://www.tigerdata.com/blog/pg-textsearch-bm25-full-text-...

It's exciting (and, to be honest, a little unnerving) to see a field I've spent so much time toiling in change so quickly in ways that enable us to be more ambitious in our technical objectives. Technical moats are moats no longer.

The benchmark scripts and methodology are available in the github repo. Happy to answer any questions in the thread.

Thanks,

TJ (tj@tigerdata.com)

Comments

jascha_eng•1h ago

FWIW TJ is not your average vibe coder imo: https://www.linkedin.com/in/todd-j-green/

In september he burned through 3000$ in API credits though, but I think that's before we finally bought max plans for everyone that wanted it.

simonw•56m ago

This is really cool. I've built things on PostgreSQL ts_vector() FTS in the past which works well but doesn't have whole-index ranking algorithms so can't do BM25.

It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?

I found this example in the README quite confusing:

  SELECT * FROM documents
  WHERE content <@> to_bm25query('search terms', 'docs_idx') < -5.0
  ORDER BY content <@> 'search terms'
  LIMIT 10;

That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.

tjgreen•35m ago

I actually don't love this example either, for the reasons you mention, but at some point we had questions about how to filter based on numeric ranking. Thanks for the reminder to revisit this.

Re filtering, there are often reasonable workarounds in the SQL context that caused me to deprioritize this for GA. With your example, the workaround is to apply post-filtering to select just matches with all desired terms. This is not ideal ergonomics since you may have to play with the LIMIT that you'll need to get enough results, but it's already a familiar pattern if you're using vector indexes. For very selective conditions, pre-filtering by those conditions and then ranking afterwards is also an option for the planner, provided you've created indexes on the columns in question.

All this is just an argument about priorities for GA. Now that v1.0 is out, we'll get signal about which features to prioritize next.

gplprotects•55m ago

> ParadeDB, is guarded behind AGPL

What a wonderful ad for ParadeDB, and clear signal that "TigerData" is a pernicious entity.

tjgreen•35m ago

Okay then!

lsaferite•28m ago

You: > "TigerData" is a pernicious entity

TigerData: > pg_textsearch v1.0 is freely available via open source (Postgres license)

They deemed AGPL untenable for their business and decided to create an OSS solution that used a license they were comfortable with and they are somehow "pernicious"? Perhaps take a moment to reflect on your characterization of a group that just contributed an alternative OSS project for a specific task. Not only that, but they used a VERY permissive license. I'd argue that they are being a better OSS community member for selecting a more permissive license.

shreyssh•23m ago

Nice work. pg_search has been on my radar for a while, having BM25 natively in Postgres instead of bolting on Elasticsearch is a huge DX win. Curious about the index build time on larger datasets though. I'm working with ~2M row tables and the bottleneck for most Postgres extensions I've tried isn't query speed, it's the initial indexing. Any benchmarks on that?

tjgreen•19m ago

Yep, there are numbers in the blog post and repo. We are able to index MS-MARCO v2 (138M documents, around 50GB of raw data) in a bit under 18 minutes.

tjgreen•16m ago

For 2M scale dataset, you should be able to index in about 1 minute on low-end hardware. See the MS-MARCO v1 (8M documents) numbers, measured on cheap Github runners.

gmassman•12m ago

Very exciting! Congrats on the release, this will be a huge benefit to all folks building RAG/rerank systems on top of Postgres. Looking forward to testing it out myself.

3abiton•7m ago

This is pretty much my case right now. BM25 is so useful in many cases and having with with postgres is neat!

jackyliang•9m ago

VERY excited about this, literally just looking to build hybrid search using Postgres FTS. When will this be available on Supabase?

tjgreen•2m ago

You'll have to ask Supabase!

Unical-A•1m ago

Impressive benchmarks. How does the BM25 implementation handle high-frequency updates (writes) while maintaining search latency? Usually, there's a trade-off between ingest speed and search performance in Postgres-based full-text search.

The Claude Code Source Leak: fake tools, frustration regexes, undercover mode

Claude Code's source code has been leaked via a map file in their NPM registry

GitHub's Historic Uptime

Cohere Transcribe: Speech Recognition

Slop is not necessarily the future

Open source CAD in the browser (Solvespace)

Show HN: Postgres extension for BM25 relevance-ranked full-text search

OkCupid gave 3M dating-app photos to facial recognition firm, FTC says

Teenage Engineering's PO-32 acoustic modem and synth implementation

Nematophagous Fungus

Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)

I Traced My Traffic Through a Home Tailscale Exit Node

A Primer on Long-Duration Life Support

Accelerating the Next Phase of AI

From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

Accidentally created my first fork bomb with Claude Code

Axios compromised on NPM – Malicious versions drop remote access trojan

4D Doom

Audio tapes reveal mass rule-breaking in Milgram's obedience experiments

Show HN: Cerno – CAPTCHA that targets LLM reasoning, not human biology

GitHub Monaspace Case Study

Securing Elliptic Curve Cryptocurrencies Against Quantum Vulnerabilities [pdf]

Combinators

Ask HN: Distributed data centers in our basements

Microsoft: Copilot is for entertainment purposes only

Scotty: A beautiful SSH task runner

What major works of literature were written after age of 85? 75? 65?

Show HN: PhAIL – Real-robot benchmark for AI models

Oracle slashes 30k jobs

Claude Code users hitting usage limits 'way faster than expected'

Show HN: Postgres extension for BM25 relevance-ranked full-text search

Comments

The Claude Code Source Leak: fake tools, frustration regexes, undercover mode

Claude Code's source code has been leaked via a map file in their NPM registry

GitHub's Historic Uptime

Cohere Transcribe: Speech Recognition

Slop is not necessarily the future

Open source CAD in the browser (Solvespace)

Show HN: Postgres extension for BM25 relevance-ranked full-text search

OkCupid gave 3M dating-app photos to facial recognition firm, FTC says

Teenage Engineering's PO-32 acoustic modem and synth implementation

Nematophagous Fungus

Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)

I Traced My Traffic Through a Home Tailscale Exit Node

A Primer on Long-Duration Life Support

Accelerating the Next Phase of AI

From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

Accidentally created my first fork bomb with Claude Code

Axios compromised on NPM – Malicious versions drop remote access trojan

4D Doom

Audio tapes reveal mass rule-breaking in Milgram's obedience experiments

Show HN: Cerno – CAPTCHA that targets LLM reasoning, not human biology

GitHub Monaspace Case Study

Securing Elliptic Curve Cryptocurrencies Against Quantum Vulnerabilities [pdf]

Combinators

Ask HN: Distributed data centers in our basements

Microsoft: Copilot is for entertainment purposes only

Scotty: A beautiful SSH task runner

What major works of literature were written after age of 85? 75? 65?

Show HN: PhAIL – Real-robot benchmark for AI models

Oracle slashes 30k jobs

Claude Code users hitting usage limits 'way faster than expected'