frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Postgres extension for BM25 relevance-ranked full-text search

https://github.com/timescale/pg_textsearch
45•tjgreen•4h ago
Last summer we faced a conundrum at my company, Tiger Data, a Postgres cloud vendor whose main business is in timeseries data. We were trying to grow our business towards emerging AI-centric workloads and wanted to provide a state-of-the-art hybrid search stack in Postgres. We'd already built pgvectorscale in house with the goal of scaling semantic search beyond pgvector's main memory limitations. We just needed a scalable ranked keyword search solution too.

The problem: core Postgres doesn't provide this; the leading Postgres BM25 extension, ParadeDB, is guarded behind AGPL; developing our own extension appeared daunting. We'd need a small team of sharp engineers and 6-12 months, I figured. And we'd probably still fall short of the performance of a mature system like Parade/Tantivy.

Or would we? I'd be experimenting long enough with AI-boosted development at that point to realize that with the latest tools (Claude Code + Opus) and an experienced hand (I've been working in database systems internals for 25 years now), the old time estimates pretty much go out the window.

I told our CTO I thought I could solo the project in one quarter. This raised some eyebrows.

It did take a little more time than that (two quarters), and we got some real help from the community (amazing!) after open-sourcing the pre-release. But I'm thrilled/exhausted today to share that pg_textsearch v1.0 is freely available via open source (Postgres license), on Tiger Data cloud, and hopefully soon, a hyperscalar near you:

https://github.com/timescale/pg_textsearch

In the blog post accompanying the release, I overview the architecture and present benchmark results using MS-MARCO. To my surprise, we were not only able to meet Parade/Tantivy's query performance, but exceed it substantially, measuring a 4.7x advantage on query throughput at scale:

https://www.tigerdata.com/blog/pg-textsearch-bm25-full-text-...

It's exciting (and, to be honest, a little unnerving) to see a field I've spent so much time toiling in change so quickly in ways that enable us to be more ambitious in our technical objectives. Technical moats are moats no longer.

The benchmark scripts and methodology are available in the github repo. Happy to answer any questions in the thread.

Thanks,

TJ (tj@tigerdata.com)

Comments

jascha_eng•1h ago
FWIW TJ is not your average vibe coder imo: https://www.linkedin.com/in/todd-j-green/

In september he burned through 3000$ in API credits though, but I think that's before we finally bought max plans for everyone that wanted it.

simonw•56m ago
This is really cool. I've built things on PostgreSQL ts_vector() FTS in the past which works well but doesn't have whole-index ranking algorithms so can't do BM25.

It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?

I found this example in the README quite confusing:

  SELECT * FROM documents
  WHERE content <@> to_bm25query('search terms', 'docs_idx') < -5.0
  ORDER BY content <@> 'search terms'
  LIMIT 10;
That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.
tjgreen•35m ago
I actually don't love this example either, for the reasons you mention, but at some point we had questions about how to filter based on numeric ranking. Thanks for the reminder to revisit this.

Re filtering, there are often reasonable workarounds in the SQL context that caused me to deprioritize this for GA. With your example, the workaround is to apply post-filtering to select just matches with all desired terms. This is not ideal ergonomics since you may have to play with the LIMIT that you'll need to get enough results, but it's already a familiar pattern if you're using vector indexes. For very selective conditions, pre-filtering by those conditions and then ranking afterwards is also an option for the planner, provided you've created indexes on the columns in question.

All this is just an argument about priorities for GA. Now that v1.0 is out, we'll get signal about which features to prioritize next.

gplprotects•55m ago
> ParadeDB, is guarded behind AGPL

What a wonderful ad for ParadeDB, and clear signal that "TigerData" is a pernicious entity.

tjgreen•35m ago
Okay then!
lsaferite•28m ago
You: > "TigerData" is a pernicious entity

TigerData: > pg_textsearch v1.0 is freely available via open source (Postgres license)

They deemed AGPL untenable for their business and decided to create an OSS solution that used a license they were comfortable with and they are somehow "pernicious"? Perhaps take a moment to reflect on your characterization of a group that just contributed an alternative OSS project for a specific task. Not only that, but they used a VERY permissive license. I'd argue that they are being a better OSS community member for selecting a more permissive license.

shreyssh•23m ago
Nice work. pg_search has been on my radar for a while, having BM25 natively in Postgres instead of bolting on Elasticsearch is a huge DX win. Curious about the index build time on larger datasets though. I'm working with ~2M row tables and the bottleneck for most Postgres extensions I've tried isn't query speed, it's the initial indexing. Any benchmarks on that?
tjgreen•19m ago
Yep, there are numbers in the blog post and repo. We are able to index MS-MARCO v2 (138M documents, around 50GB of raw data) in a bit under 18 minutes.
tjgreen•16m ago
For 2M scale dataset, you should be able to index in about 1 minute on low-end hardware. See the MS-MARCO v1 (8M documents) numbers, measured on cheap Github runners.
gmassman•12m ago
Very exciting! Congrats on the release, this will be a huge benefit to all folks building RAG/rerank systems on top of Postgres. Looking forward to testing it out myself.
3abiton•7m ago
This is pretty much my case right now. BM25 is so useful in many cases and having with with postgres is neat!
jackyliang•9m ago
VERY excited about this, literally just looking to build hybrid search using Postgres FTS. When will this be available on Supabase?
tjgreen•2m ago
You'll have to ask Supabase!
Unical-A•1m ago
Impressive benchmarks. How does the BM25 implementation handle high-frequency updates (writes) while maintaining search latency? Usually, there's a trade-off between ingest speed and search performance in Postgres-based full-text search.

The Claude Code Source Leak: fake tools, frustration regexes, undercover mode

https://alex000kim.com/posts/2026-03-31-claude-code-source-leak/
342•alex000kim•7h ago•149 comments

Claude Code's source code has been leaked via a map file in their NPM registry

https://twitter.com/Fried_rice/status/2038894956459290963
1747•treexs•11h ago•867 comments

GitHub's Historic Uptime

https://damrnelson.github.io/github-historical-uptime/
265•todsacerdoti•1h ago•71 comments

Cohere Transcribe: Speech Recognition

https://cohere.com/blog/transcribe
119•gmays•4h ago•42 comments

Slop is not necessarily the future

https://www.greptile.com/blog/ai-slopware-future
104•dakshgupta•6h ago•203 comments

Open source CAD in the browser (Solvespace)

https://solvespace.com/webver.pl
241•phkahler•8h ago•74 comments

Show HN: Postgres extension for BM25 relevance-ranked full-text search

https://github.com/timescale/pg_textsearch
45•tjgreen•4h ago•14 comments

OkCupid gave 3M dating-app photos to facial recognition firm, FTC says

https://arstechnica.com/tech-policy/2026/03/okcupid-match-pay-no-fine-for-sharing-user-photos-wit...
173•whiteboardr•3h ago•41 comments

Teenage Engineering's PO-32 acoustic modem and synth implementation

https://github.com/ericlewis/libpo32
33•ericlewis•3d ago•4 comments

Nematophagous Fungus

https://en.wikipedia.org/wiki/Nematophagous_fungus
18•lordgilman•4d ago•3 comments

Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)

https://github.com/jkool702/forkrun
73•jkool702•4d ago•10 comments

I Traced My Traffic Through a Home Tailscale Exit Node

https://tech.stonecharioteer.com/posts/2026/tailscale-exit-nodes/
13•stonecharioteer•1h ago•3 comments

A Primer on Long-Duration Life Support

https://mceglowski.substack.com/p/a-primer-on-long-duration-life-support
36•zdw•4d ago•11 comments

Accelerating the Next Phase of AI

https://openai.com/index/accelerating-the-next-phase-ai
35•surprisetalk•52m ago•39 comments

From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

https://news.future-shock.ai/the-weight-of-remembering/
43•future-shock-ai•2d ago•5 comments

Accidentally created my first fork bomb with Claude Code

https://www.droppedasbaby.com/posts/2602-01/
34•offbyone42•13h ago•7 comments

Axios compromised on NPM – Malicious versions drop remote access trojan

https://www.stepsecurity.io/blog/axios-compromised-on-npm-malicious-versions-drop-remote-access-t...
1715•mtud•18h ago•693 comments

4D Doom

https://github.com/danieldugas/HYPERHELL
4•chronolitus•3d ago•0 comments

Audio tapes reveal mass rule-breaking in Milgram's obedience experiments

https://www.psypost.org/audio-tapes-reveal-mass-rule-breaking-in-milgram-s-obedience-experiments-...
177•lentoutcry•3d ago•109 comments

Show HN: Cerno – CAPTCHA that targets LLM reasoning, not human biology

https://cerno.sh
9•plawlost•1h ago•18 comments

GitHub Monaspace Case Study

https://lettermatic.com/custom/monaspace-case-study
88•homebrewer•5h ago•26 comments

Securing Elliptic Curve Cryptocurrencies Against Quantum Vulnerabilities [pdf]

https://quantumai.google/static/site-assets/downloads/cryptocurrency-whitepaper.pdf
35•jandrewrogers•4h ago•18 comments

Combinators

https://tinyapl.rubenverg.com/docs/info/combinators
116•tosh•9h ago•34 comments

Ask HN: Distributed data centers in our basements

31•cmos•6h ago•49 comments

Microsoft: Copilot is for entertainment purposes only

https://www.microsoft.com/en-us/microsoft-copilot/for-individuals/termsofuse
354•lpcvoid•6h ago•140 comments

Scotty: A beautiful SSH task runner

https://freek.dev/3064-scotty-a-beautiful-ssh-task-runner
29•speckx•4h ago•18 comments

What major works of literature were written after age of 85? 75? 65?

https://statmodeling.stat.columbia.edu/2026/03/25/what-major-works-of-literature-were-written-aft...
111•paulpauper•3d ago•77 comments

Show HN: PhAIL – Real-robot benchmark for AI models

https://phail.ai
17•vertix•4h ago•8 comments

Oracle slashes 30k jobs

https://rollingout.com/2026/03/31/oracle-slashes-30000-jobs-with-a-cold-6/
779•pje•6h ago•676 comments

Claude Code users hitting usage limits 'way faster than expected'

https://www.theregister.com/2026/03/31/anthropic_claude_code_limits/
240•samizdis•8h ago•150 comments