Language Support for Marginalia Search

https://www.marginalia.nu/log/a_126_multilingual/

178•Bogdanp•3mo ago

Comments

ofalkaed•3mo ago

Surprisingly informative for what is pretty much a press release, learned a good deal about search engines.

marginalia_nu•3mo ago

(author)

I'm kinda allergic to writing "I did the thing" posts, so I can't help but tryhard and attempt to make them compelling somehow.

Writing in this manner is also very helpful in making sense of the work for myself. Takes a better understanding of the subject to thoroughly explain what you've built than to merely build it. Sometimes I've gone back and read through one of these updates to just get a refresher on what my thinking was when I built something.

ofalkaed•3mo ago

In my experience, that is pretty much what marginalia search is. I rarely get what I expect but I always get something very interesting that makes me understand my expectations better which is very helpful in accomplishing my goals. Thanks for your work, marginalia is probably my favorite little corner of the web.

LTom•3mo ago

A quick question: are you looking for feedback on search results in other languages (as in, what I expect vs. what I get), or is it too early for that?

marginalia_nu•3mo ago

Yeah it's definitely helpful to have those types of reports.

reedf1•3mo ago

Took me too long to realize this wasn't a tool to search for marginalia in scanned manuscripts.

iamnothere•3mo ago

Hey, at least it isn’t named after a very large number, an excited exclamation, or a sound effect. Surely no product with one of those names would ever succeed.

marginalia_nu•3mo ago

I probably should have named it cartoon-trombone.wav in retrospect.

reedf1•3mo ago

It's a fine name! I had marginalia on the mind - I am reading The Name of the Rose.

iamnothere•3mo ago

That makes sense. I am perhaps overly sensitive to the drive by “name haters” who seem to show up in every FOSS or indie project thread.

reedf1•3mo ago

I feel a bit bad it was interpreted that way.

Some fun context, I was trying to find a scanned copy of the first 'correct' book on optics (written by https://en.wikipedia.org/wiki/Ibn_al-Haytham). Possibly the first person to really use the scientific method in circa 1000CE (!!). And I found this (https://cudl.lib.cam.ac.uk/view/MS-PETERHOUSE-00209/103) filled with interesting optical diagrams like something out of my high school physics notebooks. Anyway - I was also thinking about how they might index interesting doodles in the margins. So it was on my mind.

internet_points•3mo ago

What tools/data do you use for pos-tagging? I'm guessing it has to be fast, to run without a google data center :)

marginalia_nu•3mo ago

I'm using RDRPosTagger[1], though I've optimized the code a bit so that it's not just algorithmically efficient, but to use the language in a way that is fast. It isn't perfect, but it's good enough to be useful.

Language detection and sentence splitting are the other two slow bits of processing.

[1] https://github.com/datquocnguyen/RDRPOSTagger

mariusor•3mo ago

Off topic, but would there be a way to integrate marginalia with a specific website? Similarly to how people use google search for their forums or how HN uses algolia?

I'm asking this as one of my projects is a link aggregator similar to old reddit (and HN to some extent) and I would like to be able to present to users a search box, but without having to implement document indexing and search. (I assume ad principio that the website is already aligned ethically and technologically with what Marginalia stands for :D)

marginalia_nu•3mo ago

Should be soon-ish. I'm working right now on laying the ground works for ad-hoc domain filters. That's technically already possible but comes at a too big performance impact that it deteriorates the search results.

When it works, one of the things I have in mind is making a site search-esque functionality available, as well as exposing it via the public API so that it can be whiteboxed.

mariusor•3mo ago

Nice. Is there a way to track the work you're doing there (and in general actually)?

marginalia_nu•3mo ago

Best is probably the search-engine tag on my blog[1]. It's the closest you get to release notes for the project.

[1] https://www.marginalia.nu/tags/search-engine/

juliend2•3mo ago

I remember asking you for this, so Thank you so much! It works quite well from what I can see.

Small UI issue: on Desktop, the left sidebar should be scrollable, because now on Firefox I can't reach the "Language" menu item in the search results view, unless I zoom-out.

vintermann•3mo ago

This is never going to work. The author is apparently against AI in search in favor of "simplicity", but this sort of thing

> Sentences are stemmed and POS-tagged. Sentences, with stemming and POS-tag data is fed into keyword extraction algorithms

IS AI, it's just old fashioned and bad AI. What he's trying will never work well, for the same reason rule-based machine translation never worked well: there are just too many rules and exceptions. Simplicity is great when you can have it, but with human language, simplicity was never on the table.

He's going to have to bite the bullet and use document embedding models sooner or later.

marginalia_nu•3mo ago

This code is just for helping identify document topics, it literally doesn't need to be perfect. Embedding a billion documents with a server that has no GPU is neither practical nor something that yields good results.

smoghat•3mo ago

I’m a little confused by Marginalia. I looked to find out what its purpose was, but couldn’t find it. My bad, I guess, but then again I’m not a search engine. It is pretty cool for a DIY project but the results were really off, especially for searches for individuals. Like take Ezra Klein as an example. Sure there is a link to his show from castbox, a service I have never heard of, and then a bunch of anti Ezra Klein articles. Wikipedia shows up, the last link of the first page is to Abundance. But no NYT? That seems like a big problem. I thought I’d look up Daring Fireball and the only link to his site was a ways down and was to a list of links in 2008. These are just two random searches. I did others, starting with myself, and my results were similar.

Likely I am totally not understanding what this search engine is for. I see this a lot on submissions here. I find something interesting sounding but I don’t understand the context. Maybe it’s just me, but it’s confusing.

FabCH•3mo ago

It's a one-man Search engine developed and hosted in the EU.

If you read his about page, it is basically an anti-centralization anti-ad anti-spyware attempt at websearch. It is also "The project is independent in that it has no loans, no investors looking for a payday, no strings attached anywhere to pressure it into doing anything than providing as much and as good internet search as it is capable of."

It not indexing NYT seems precisely on brand.

marginalia_nu•3mo ago

It does index bits of NYT, but coverage is pretty spotty outside of their archives. They put a lot of crawler countermeasures up on their main site (which I guess is fair, they have a business to run), but author biographies are generally accessible, including Ezra's[1].

Though since the search engine doesn't really apply much in terms of domain authority, this doesn't rank very highly, the websites that talk about Ezra Klein rank higher.

[1] https://marginalia-search.com/search?query=site%3Anytimes.co...

marginalia_nu•3mo ago

The point of Marginalia Search, as far as there is one, is mostly to complement the bigger search engines by providing tools to find obscure stuff that's drowned out elsewhere, mostly by offering a bunch of filters.

It's not a google replacement, and if you already know what you're looking for then it's probably not the right tool.

Maybe you're looking for mechanical keyboard discussions, then maybe a search for "mechanical keyboard" in the Blogs or Forums filters will provide results you are into.

It's also pretty good at unearthing weird stuff. Say you want to read up on Jack Parsons[3], that Jet Propulsion Lab guy who dabbled in occultism, fell in with Alistair Crowley and then got scammed out of his wealth by L Ron Hubbard, and finally blew himself up, well that is the sort of topic Marginalia Search generally excels at.

[1] https://marginalia-search.com/search?query=mechanical+keyboa...

[2] https://marginalia-search.com/search?query=mechanical+keyboa...

[3] https://marginalia-search.com/search?query=Jack+Parsons&prof...

iamnothere•3mo ago

It’s for finding results that are less common or more unlikely to appear on other engines, so your results make sense. Why would you need yet another link to an NYT article? That space is crowded. Every engine will find it.

Where it particularly shines is finding highly specific results that get buried in other search engines. Some topics (particularly topics of high commercial interest) have become impossible to research on mainstream search engines. Marginalia will actually find informative articles about these topics rather than page after page of product results and spam.

It may not be useful to you if you’re not a researcher, writer, or someone who often needs to dig deeply into subjects beyond the level of common knowledge.

atombender•3mo ago

> Thankfully the BM-25 model used in ranking is robust to this, as it relies on live data from the index itself.

I'm confused by this. TD-IDF incorporates the term frequency (the IDF part), which search engines precompute for the index as a whole. But so does BM25; its IDF formula is slightly different, but also relies on term frequencies. What's the difference?

marginalia_nu•3mo ago

The index has the most up-to-date term frequency information, but it is logistically inacessible, and it's not really practical to interrogate it when extracting keywords (as you need this information for 100 billion terms), so a somewhat stale version is kept in memory instead and used in that process.

When searching, doing BM25, it is a lot more accessible as you already fetch that information indirectly as part of looking up the documents lists, and this is typically only done up to about a dozen times per query.

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Language Support for Marginalia Search

Comments