Show HN: Postgres extension for BM25 relevance-ranked full-text search

https://github.com/timescale/pg_textsearch

55•tjgreen•4h ago

Last summer we faced a conundrum at my company, Tiger Data, a Postgres cloud vendor whose main business is in timeseries data. We were trying to grow our business towards emerging AI-centric workloads and wanted to provide a state-of-the-art hybrid search stack in Postgres. We'd already built pgvectorscale in house with the goal of scaling semantic search beyond pgvector's main memory limitations. We just needed a scalable ranked keyword search solution too.

The problem: core Postgres doesn't provide this; the leading Postgres BM25 extension, ParadeDB, is guarded behind AGPL; developing our own extension appeared daunting. We'd need a small team of sharp engineers and 6-12 months, I figured. And we'd probably still fall short of the performance of a mature system like Parade/Tantivy.

Or would we? I'd be experimenting long enough with AI-boosted development at that point to realize that with the latest tools (Claude Code + Opus) and an experienced hand (I've been working in database systems internals for 25 years now), the old time estimates pretty much go out the window.

I told our CTO I thought I could solo the project in one quarter. This raised some eyebrows.

It did take a little more time than that (two quarters), and we got some real help from the community (amazing!) after open-sourcing the pre-release. But I'm thrilled/exhausted today to share that pg_textsearch v1.0 is freely available via open source (Postgres license), on Tiger Data cloud, and hopefully soon, a hyperscalar near you:

https://github.com/timescale/pg_textsearch

In the blog post accompanying the release, I overview the architecture and present benchmark results using MS-MARCO. To my surprise, we were not only able to meet Parade/Tantivy's query performance, but exceed it substantially, measuring a 4.7x advantage on query throughput at scale:

https://www.tigerdata.com/blog/pg-textsearch-bm25-full-text-...

It's exciting (and, to be honest, a little unnerving) to see a field I've spent so much time toiling in change so quickly in ways that enable us to be more ambitious in our technical objectives. Technical moats are moats no longer.

The benchmark scripts and methodology are available in the github repo. Happy to answer any questions in the thread.

Thanks,

TJ (tj@tigerdata.com)

Comments

jascha_eng•1h ago

FWIW TJ is not your average vibe coder imo: https://www.linkedin.com/in/todd-j-green/

In september he burned through 3000$ in API credits though, but I think that's before we finally bought max plans for everyone that wanted it.

simonw•1h ago

This is really cool. I've built things on PostgreSQL ts_vector() FTS in the past which works well but doesn't have whole-index ranking algorithms so can't do BM25.

It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?

I found this example in the README quite confusing:

  SELECT * FROM documents
  WHERE content <@> to_bm25query('search terms', 'docs_idx') < -5.0
  ORDER BY content <@> 'search terms'
  LIMIT 10;

That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.

tjgreen•1h ago

I actually don't love this example either, for the reasons you mention, but at some point we had questions about how to filter based on numeric ranking. Thanks for the reminder to revisit this.

Re filtering, there are often reasonable workarounds in the SQL context that caused me to deprioritize this for GA. With your example, the workaround is to apply post-filtering to select just matches with all desired terms. This is not ideal ergonomics since you may have to play with the LIMIT that you'll need to get enough results, but it's already a familiar pattern if you're using vector indexes. For very selective conditions, pre-filtering by those conditions and then ranking afterwards is also an option for the planner, provided you've created indexes on the columns in question.

All this is just an argument about priorities for GA. Now that v1.0 is out, we'll get signal about which features to prioritize next.

mbreese•13m ago

While we’re talking about filtering — is there a way to set a WHERE clause when you’re setting up the index? I’ve been working on this a lot recently for a hybrid vector search in pg. One of the things that I’m running up against is setting a good BM25 index for a subset of a table (the where clause). I have a document subsets with very different word frequencies, so I’m trying to make sure that the search works on a set subset.

I think I can also setup partitions for this, but while you’re here… I’m very excited to start to roll this out.

tjgreen•2m ago

Partitions would be one option, and we've got pretty robust partitioned table support in the extension. (Timescaledb uses partitioning for hypertables, so we had to front-load that support). Expression indexes would be another option, not yet done but there is a community PR in flight: https://github.com/timescale/pg_textsearch/pull/154

gplprotects•1h ago

> ParadeDB, is guarded behind AGPL

What a wonderful ad for ParadeDB, and clear signal that "TigerData" is a pernicious entity.

tjgreen•1h ago

Okay then!

lsaferite•55m ago

You: > "TigerData" is a pernicious entity

TigerData: > pg_textsearch v1.0 is freely available via open source (Postgres license)

They deemed AGPL untenable for their business and decided to create an OSS solution that used a license they were comfortable with and they are somehow "pernicious"? Perhaps take a moment to reflect on your characterization of a group that just contributed an alternative OSS project for a specific task. Not only that, but they used a VERY permissive license. I'd argue that they are being a better OSS community member for selecting a more permissive license.

shreyssh•51m ago

Nice work. pg_search has been on my radar for a while, having BM25 natively in Postgres instead of bolting on Elasticsearch is a huge DX win. Curious about the index build time on larger datasets though. I'm working with ~2M row tables and the bottleneck for most Postgres extensions I've tried isn't query speed, it's the initial indexing. Any benchmarks on that?

tjgreen•46m ago

Yep, there are numbers in the blog post and repo. We are able to index MS-MARCO v2 (138M documents, around 50GB of raw data) in a bit under 18 minutes.

tjgreen•43m ago

For 2M scale dataset, you should be able to index in about 1 minute on low-end hardware. See the MS-MARCO v1 (8M documents) numbers, measured on cheap Github runners.

gmassman•39m ago

Very exciting! Congrats on the release, this will be a huge benefit to all folks building RAG/rerank systems on top of Postgres. Looking forward to testing it out myself.

3abiton•35m ago

This is pretty much my case right now. BM25 is so useful in many cases and having with with postgres is neat!

jackyliang•36m ago

VERY excited about this, literally just looking to build hybrid search using Postgres FTS. When will this be available on Supabase?

tjgreen•30m ago

You'll have to ask Supabase!

Unical-A•29m ago

Impressive benchmarks. How does the BM25 implementation handle high-frequency updates (writes) while maintaining search latency? Usually, there's a trade-off between ingest speed and search performance in Postgres-based full-text search.

tjgreen•15m ago

There is indeed such a tradeoff. The architecture is designed with an eye towards making this tradeoff tunable (frequency of memtable spills, aggressiveness of compaction) but the work here is not yet finished. We chose to prioritize optimizing bulk-indexing and query performance for GA, since this is already enough for many applications. I'm excited to get to the point where we have brag-worthy benchmark numbers for high-frequency updates as well!

andai•26m ago

Can you explain this in more detail? Is this for RAG, i.e. combining vector search with keyword search?

My knowledge on that subject roughly begins and ends with this excellent article, so I'd love to hear how this relates to that.

https://www.anthropic.com/engineering/contextual-retrieval

Especially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.

tjgreen•21m ago

Yes, hybrid search is one of the main current use cases we had in mind developing the extension, but it works for old-fashioned standalone keyword-only search as well. There is a lot of art to how you combine keyword and semantic search (there are entire companies like Cohere devoted to just this step!). We're leaving this part, at least for now, up to application developers.

zephyrwhimsy•16m ago

Input quality is almost always the actual bottleneck. Teams spend months tuning retrieval while feeding HTML boilerplate into their vector stores.

timedude•14m ago

When is this available on AWS in Aurora? Anyone from AWS here, add it pronto

mattbessey•5m ago

Please oh please let GCP add this to the supported managed Postgres extensions...

Show HN: Postgres extension for BM25 relevance-ranked full-text search

Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs

Show HN: PhAIL – Real-robot benchmark for AI models

Show HN: Loreline, narrative language transpiled via Haxe: C++/C#/JS/Java/Py/Lua

Show HN: Sundial – a new way to look at a weather forecast

Show HN: EU Leadership – Live API data site comparing Europe to the world

Show HN: Multi-agent autoresearch for ANE inference beats Apple's CoreML by 6×

Show HN: Hyprmoncfg – Terminal-based monitor config manager for Hyprland

Show HN: Lazy-tool: reducing prompt bloat in MCP-based agent workflows

Show HN: I turned a sketch into a 3D-print pegboard for my kid with an AI agent

Show HN: DeepTable – an API that converts messy Excel files into structured data

Show HN: Coasts – Containerized Hosts for Agents

Show HN: Pardus Browser- a browser for AI agents without Chromium

Show HN: Margo – Find the font your brain reads fastest

Show HN: ClawDesk – Agent orchestration layer on top of OpenClaw

Show HN: I built a self-hosted Fly.io engine using Go and Firecracker

Show HN: Solitaire – identity layer for AI agents, not just another memory tool

Show HN: LogicStamp – A Context Compiler for TypeScript

Show HN: INTERCALsky.ATproto client.Ada carries packets.INTERCAL carries meaning

Show HN: Gravimera, AI(LLM) driven 3D world editor and explorer

Show HN: Prawduct, a product development framework for Claude Code

Show HN: Reprompt – Analyze what you type into AI tools, not what they output

Show HN: PromptQL – AI-Native Slack

Show HN: Vibe Check – UX Benchmark for vibe designs

Show HN: Trama – Stop writing agent orchestration

Show HN: Wageslave – I quit my soul sucking job to make a game about it

Show HN: Rust UEFI UI Lib

Show HN: Signboard – Kanban app lists are folders and cards are Markdown files

Show HN: WebRTC video calls, no account needed

Show HN: Postgres extension for BM25 relevance-ranked full-text search

Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs

Show HN: PhAIL – Real-robot benchmark for AI models

Show HN: Loreline, narrative language transpiled via Haxe: C++/C#/JS/Java/Py/Lua

Show HN: Sundial – a new way to look at a weather forecast

Show HN: EU Leadership – Live API data site comparing Europe to the world

Show HN: Multi-agent autoresearch for ANE inference beats Apple's CoreML by 6×

Show HN: Hyprmoncfg – Terminal-based monitor config manager for Hyprland

Show HN: Lazy-tool: reducing prompt bloat in MCP-based agent workflows

Show HN: I turned a sketch into a 3D-print pegboard for my kid with an AI agent

Show HN: DeepTable – an API that converts messy Excel files into structured data

Show HN: Coasts – Containerized Hosts for Agents

Show HN: Pardus Browser- a browser for AI agents without Chromium

Show HN: Margo – Find the font your brain reads fastest

Show HN: ClawDesk – Agent orchestration layer on top of OpenClaw

Show HN: I built a self-hosted Fly.io engine using Go and Firecracker

Show HN: Solitaire – identity layer for AI agents, not just another memory tool

Show HN: LogicStamp – A Context Compiler for TypeScript

Show HN: INTERCALsky.ATproto client.Ada carries packets.INTERCAL carries meaning

Show HN: Gravimera, AI(LLM) driven 3D world editor and explorer

Show HN: Prawduct, a product development framework for Claude Code

Show HN: Reprompt – Analyze what you type into AI tools, not what they output

Show HN: PromptQL – AI-Native Slack

Show HN: Vibe Check – UX Benchmark for vibe designs

Show HN: Trama – Stop writing agent orchestration

Show HN: Wageslave – I quit my soul sucking job to make a game about it

Show HN: Rust UEFI UI Lib

Show HN: Signboard – Kanban app lists are folders and cards are Markdown files

Show HN: WebRTC video calls, no account needed

Show HN: Postgres extension for BM25 relevance-ranked full-text search

Comments