Show HN: Semlib – Semantic Data Processing

22•anishathalye•2h ago

Comments

Y_Y•1h ago

  >>> await sort(presidents, by="right-leaning")
  ['Jimmy Carter', 'Bill Clinton', 'George H. W. Bush', 'Ronald Reagan']

Is this supposed to be impressive? GIGO, if you want to vibe-classify your data then go right ahead, but I hope nobody serious relies on it.

hobofan•1h ago

Why not?

List-sorting/prioritizing list is among one of the best use-cases for LLMs, especially if the metrics for it are fuzzy, e.g. "what are the 10 sales lead of this list of 1000 that I should prioritize".

One of the more interesting approaches for that is arbitron[0], which does pairwise ranking with multiple metrics/agents to provide a multi-faceted sorting.

[0]: https://github.com/davidgasquez/arbitron

anishathalye•1h ago

That was a small self-contained example that fit above the fold in the README (and fwiw even last year’s models like GPT-4o give the right output there). That `sort` is based on pairwise comparisons, which is one of the best ways you can do it in terms of accuracy (Qin et al., 2023: https://arxiv.org/abs/2306.17563).

I think there are many real use cases where you might want a semantic sort / semantic data processing in general, when there isn’t a deterministic way to do the task and there is not necessarily a single right answer, and some amount of error (due to LLMs being imperfect) is tolerable. See https://semlib.anish.io/examples/arxiv-recommendations/ for one concrete example. In my opinion, the outputs are pretty high quality, to the point where this is practically usable.

These primitives can be _composed_, and that’s where this approach really shines. As a case study, I tried automating a part of performance reviews at my company, and the Semlib+LLM approach did _better_ than me (don’t worry, I didn’t dump AI-generated outputs on people, I first did the work manually, and shared both versions with an explanation of where each version came from). See the case study in https://anishathalye.com/semlib/

There’s also some related academic work in this area that also talks about applications. One of the most compelling IMO is DocETL’s collaboration to analyze police records (https://arxiv.org/abs/2410.12189). Some others you might enjoy checking out are LOTUS (https://arxiv.org/abs/2407.11418v1), Palimpzest (https://arxiv.org/abs/2405.14696), and Aryn (https://arxiv.org/abs/2409.00847).

Y_Y•59m ago

As you compose fuzzy operations your errors multiply! Nobody is asking for perfection, but this tool seems to me a straightforward way to launder bad data. If you want to do a quick check of an idea then it's probably great, but if you're going to be rigorous and use hard data and reproducible, understandable methods then I don't think it offers anything. The plea for citations at the end of the readme also rubs me the wrong way.

anishathalye•41m ago

I think semantic data processing in this style has a nonempty set of use cases (e.g., I find the fuzzy sorting of arXiv papers to be useful, I find the examples in the docs representative of some real-world tasks where this style of data processing makes sense, and I find many of the motivating examples and use cases in the academic work compelling). At the same time, I think there are many tasks for which this approach is not the right one to use.

Sorry you didn't like the wording in the README, that was not the intention. I like to give people a canonical form they can copy-paste if they want to cite the work, things have been a mess for many of my other GitHub repos, which makes it hard to find who is using the work (which can be really informative for improving the software, and I often follow-up with authors of papers via email etc.). For example, I heard about Amazon MemoryDB because they use Porcupine (https://dl.acm.org/doi/pdf/10.1145/3626246.3653380). Appreciate you sharing your feelings; I stripped the text from the README; if you have additional suggestions, would appreciate your comments or a PR.

esafak•1h ago

Instead of building a new data processing library, I would have offered only the novel NLP part and exposed it to existing libraries like pandas, polars, and spacy.

Does it batch requests?

anishathalye•32m ago

Yeah, that was just a design choice that I made: I wanted a library that worked with `Iterator`s, felt more lightweight to me / fit my immediate needs better. I'm personally not a huge fan of Pandas DataFrames for certain applications.

LOTUS (by Liana Patel et al., folks from Stanford and Berkeley; https://arxiv.org/abs/2407.11418) extends Pandas DataFrames with semantic operators, you could check out their open-source library: https://github.com/lotus-data/lotus

Semlib does batch requests, that was one of the primary motivations (I wanted to solve some concrete data processing tasks, started using the OpenAI API directly, then started calling LLMs in a for-loop, then wanted concurrency...). Semlib lets you set `max_concurrency` when you construct a session, and then many of the algorithms like `map` and `sort` take advantage of I/O concurrency (e.g., see the heart of the implementation of Quicksort with I/O concurrency: https://github.com/anishathalye/semlib/blob/5fa5c4534b91aa0e...). I wrote a bit more about the origins of this library on my blog, if you are interested: https://anishathalye.com/semlib/

ETA: I interpreted “batching” as I/O concurrency. If you were referring to the batch APIs that some providers offer: Semlib does not use those. They are too slow for the kind of data processing I wanted to do / not great when you have a lot of data dependencies. For example, a semantic Quicksort would take forever if each batch is processed in 24 hours (the upper bound when using Anthropic’s batch APIs, for example).

Hosting a website on a disposable vape

Launch HN: Trigger.dev (YC W23) – Open-source platform to build reliable AI apps

CubeSats are fascinating learning tools for space

How to self-host a web font from Google Fonts

Programming Deflation

How big a solar battery do I need to store all my home's electricity?

RustGPT: A pure-Rust transformer LLM built from scratch

Removing newlines in FASTA file increases ZSTD compression ratio by 10x

Folks, we have the best π

Show HN: Daffodil – Open-Source Ecommerce Framework to connect to any platform

PayPal to support Ethereum and Bitcoin

Apple has a private CSS property to add Liquid Glass effects to web content

A string formatting library in 65 lines of C++

Show HN: Semlib – Semantic Data Processing

Language Models Pack Billions of Concepts into 12k Dimensions

The Mac App Flea Market

Show HN: I reverse engineered macOS to allow custom Lock Screen wallpapers

Pgstream: Postgres streaming logical replication with DDL changes

Death to type classes

Meta bypassed Apple privacy protections, claims former employee

A qualitative analysis of pig-butchering scams

Creating a VGA Signal in Hubris

Not all browsers perform revocation checking

Which NPM package has the largest version number?

Thought police bill introduced to revoke US passport for criticism of Israel

The madness of SaaS chargebacks

The Culture novels as a dystopia

Cory Doctorow: "centaurs" and "reverse-centaurs"

Denmark's Justice Minister calls encrypted messaging a false civil liberty

Human writers have always used the em dash

Show HN: Semlib – Semantic Data Processing

Comments

Hosting a website on a disposable vape

Launch HN: Trigger.dev (YC W23) – Open-source platform to build reliable AI apps

CubeSats are fascinating learning tools for space

How to self-host a web font from Google Fonts

Programming Deflation

How big a solar battery do I need to store all my home's electricity?

RustGPT: A pure-Rust transformer LLM built from scratch

Removing newlines in FASTA file increases ZSTD compression ratio by 10x

Folks, we have the best π

Show HN: Daffodil – Open-Source Ecommerce Framework to connect to any platform

PayPal to support Ethereum and Bitcoin

Apple has a private CSS property to add Liquid Glass effects to web content

A string formatting library in 65 lines of C++

Show HN: Semlib – Semantic Data Processing

Language Models Pack Billions of Concepts into 12k Dimensions

The Mac App Flea Market

Show HN: I reverse engineered macOS to allow custom Lock Screen wallpapers

Pgstream: Postgres streaming logical replication with DDL changes

Death to type classes

Meta bypassed Apple privacy protections, claims former employee

A qualitative analysis of pig-butchering scams

Creating a VGA Signal in Hubris

Not all browsers perform revocation checking

Which NPM package has the largest version number?

Thought police bill introduced to revoke US passport for criticism of Israel

The madness of SaaS chargebacks

The Culture novels as a dystopia

Cory Doctorow: "centaurs" and "reverse-centaurs"

Denmark's Justice Minister calls encrypted messaging a false civil liberty

Human writers have always used the em dash