Fast Concordance: Instant concordance on a corpus of >1,200 books

52•evakhoury•3w ago

Comments

2b3a51•2w ago

It is, indeed, impressively fast. The results seem to be sorted by first name of author. Is that a deliberate choice?

simonw•2w ago

This is a neat brute-force search system - it uses goroutines, one for each of the 1,200 books in the corpus, and has each one do a regex search against the in-memory text for that book.

Here's a neat trick I picked up from the source code:

    indices := fdr.rgx.FindAllStringSubmatchIndex(text, -1)

    for _, pair := range indices {
        start := pair[0]
        end := pair[1]
        leftStart := max(0, start-CONTEXT_LENGTH)
        rightEnd := min(end+CONTEXT_LENGTH, len(text))

        // TODO: this doesn't work with Unicode
        if start > 0 && isLetter(text[start-1]) {
            continue
        }

        if end < len(text) && isLetter(text[end]) {
            continue
        }

An earlier comment explains this:

    // The '\b' word boundary regex pattern is very slow. So we don't use it here and
    // instead filter for word boundaries inside `findConcordance`.
    // TODO: case-insensitive matching - (?i) flag (but it's slow)
    pattern := regexp.QuoteMeta(keyword)

So instead of `\bWORD\b` it does the simplest possible match and then checks to see if the character one index before the match and or one index after the matches are also letters. If they are it skips the match.

never_inline•2w ago

Spinning 1K goroutines per request doesn't feel right to me for some reason.

Isn't trigram search supposed to be better?

https://swtch.com/~rsc/regexp/regexp4.html

drivebyhooting•2w ago

It seems to work at the word level.

Why not use a precomputed posting list?

mrkeen•2w ago

Yeah I can't figure out if this is something the author stands by or if it's just a project to mess around with goroutines or something. And it's unfair to criticise if it isn't meant to be good.

> The server reads all the documents into memory at start-up. The corpus occupies about 600 MB, so this is reasonable, though it pushes the limits of what a cloud server with 1 GB of RAM can handle. With 2 GB, it's no problem.

1200 books per 1GB server? Whole-internet search engines are older than 1GB servers.

> queries that take 2,000 milliseconds from disk can be done in 800 milliseconds from memory. That's still too slow, though, which is why fast-concordance uses [lots of threads]

No query should ever take either of those amounts of time. And the "optimisation" is to just use more threads. Which other consumers could have used to run their searches, but now can't.

https://www.pingdom.com/blog/original-google-setup-at-stanfo...

est•2w ago

It's very fast, and the result aligning by keyword looks super cool.

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

The AI boom is causing shortages everywhere else

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Vocal Guide – belt sing without killing yourself

Selection Rather Than Prediction

Speed up responses with fast mode

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Software factories and the agentic moment

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

An Update on Heroku

Ga68, a GNU Algol 68 Compiler

Show HN: If you lose your memory, how to regain access to your computer?

Show HN: I spent 4 years building a UI design tool with only the features I use