frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Semlib – Semantic Data Processing

https://github.com/anishathalye/semlib
22•anishathalye•2h ago

Comments

Y_Y•1h ago

  >>> await sort(presidents, by="right-leaning")
  ['Jimmy Carter', 'Bill Clinton', 'George H. W. Bush', 'Ronald Reagan']
Is this supposed to be impressive? GIGO, if you want to vibe-classify your data then go right ahead, but I hope nobody serious relies on it.
hobofan•1h ago
Why not?

List-sorting/prioritizing list is among one of the best use-cases for LLMs, especially if the metrics for it are fuzzy, e.g. "what are the 10 sales lead of this list of 1000 that I should prioritize".

One of the more interesting approaches for that is arbitron[0], which does pairwise ranking with multiple metrics/agents to provide a multi-faceted sorting.

[0]: https://github.com/davidgasquez/arbitron

anishathalye•1h ago
That was a small self-contained example that fit above the fold in the README (and fwiw even last year’s models like GPT-4o give the right output there). That `sort` is based on pairwise comparisons, which is one of the best ways you can do it in terms of accuracy (Qin et al., 2023: https://arxiv.org/abs/2306.17563).

I think there are many real use cases where you might want a semantic sort / semantic data processing in general, when there isn’t a deterministic way to do the task and there is not necessarily a single right answer, and some amount of error (due to LLMs being imperfect) is tolerable. See https://semlib.anish.io/examples/arxiv-recommendations/ for one concrete example. In my opinion, the outputs are pretty high quality, to the point where this is practically usable.

These primitives can be _composed_, and that’s where this approach really shines. As a case study, I tried automating a part of performance reviews at my company, and the Semlib+LLM approach did _better_ than me (don’t worry, I didn’t dump AI-generated outputs on people, I first did the work manually, and shared both versions with an explanation of where each version came from). See the case study in https://anishathalye.com/semlib/

There’s also some related academic work in this area that also talks about applications. One of the most compelling IMO is DocETL’s collaboration to analyze police records (https://arxiv.org/abs/2410.12189). Some others you might enjoy checking out are LOTUS (https://arxiv.org/abs/2407.11418v1), Palimpzest (https://arxiv.org/abs/2405.14696), and Aryn (https://arxiv.org/abs/2409.00847).

Y_Y•59m ago
As you compose fuzzy operations your errors multiply! Nobody is asking for perfection, but this tool seems to me a straightforward way to launder bad data. If you want to do a quick check of an idea then it's probably great, but if you're going to be rigorous and use hard data and reproducible, understandable methods then I don't think it offers anything. The plea for citations at the end of the readme also rubs me the wrong way.
anishathalye•41m ago
I think semantic data processing in this style has a nonempty set of use cases (e.g., I find the fuzzy sorting of arXiv papers to be useful, I find the examples in the docs representative of some real-world tasks where this style of data processing makes sense, and I find many of the motivating examples and use cases in the academic work compelling). At the same time, I think there are many tasks for which this approach is not the right one to use.

Sorry you didn't like the wording in the README, that was not the intention. I like to give people a canonical form they can copy-paste if they want to cite the work, things have been a mess for many of my other GitHub repos, which makes it hard to find who is using the work (which can be really informative for improving the software, and I often follow-up with authors of papers via email etc.). For example, I heard about Amazon MemoryDB because they use Porcupine (https://dl.acm.org/doi/pdf/10.1145/3626246.3653380). Appreciate you sharing your feelings; I stripped the text from the README; if you have additional suggestions, would appreciate your comments or a PR.

esafak•1h ago
Instead of building a new data processing library, I would have offered only the novel NLP part and exposed it to existing libraries like pandas, polars, and spacy.

Does it batch requests?

anishathalye•32m ago
Yeah, that was just a design choice that I made: I wanted a library that worked with `Iterator`s, felt more lightweight to me / fit my immediate needs better. I'm personally not a huge fan of Pandas DataFrames for certain applications.

LOTUS (by Liana Patel et al., folks from Stanford and Berkeley; https://arxiv.org/abs/2407.11418) extends Pandas DataFrames with semantic operators, you could check out their open-source library: https://github.com/lotus-data/lotus

Semlib does batch requests, that was one of the primary motivations (I wanted to solve some concrete data processing tasks, started using the OpenAI API directly, then started calling LLMs in a for-loop, then wanted concurrency...). Semlib lets you set `max_concurrency` when you construct a session, and then many of the algorithms like `map` and `sort` take advantage of I/O concurrency (e.g., see the heart of the implementation of Quicksort with I/O concurrency: https://github.com/anishathalye/semlib/blob/5fa5c4534b91aa0e...). I wrote a bit more about the origins of this library on my blog, if you are interested: https://anishathalye.com/semlib/

ETA: I interpreted “batching” as I/O concurrency. If you were referring to the batch APIs that some providers offer: Semlib does not use those. They are too slow for the kind of data processing I wanted to do / not great when you have a lot of data dependencies. For example, a semantic Quicksort would take forever if each batch is processed in 24 hours (the upper bound when using Anthropic’s batch APIs, for example).

Hosting a website on a disposable vape

https://bogdanthegeek.github.io/blog/projects/vapeserver/
417•dmazin•3h ago•101 comments

Launch HN: Trigger.dev (YC W23) – Open-source platform to build reliable AI apps

31•eallam•54m ago•17 comments

CubeSats are fascinating learning tools for space

https://www.jeffgeerling.com/blog/2025/cubesats-are-fascinating-learning-tools-space
62•warrenm•2h ago•10 comments

How to self-host a web font from Google Fonts

https://blog.velocifyer.com/Posts/3,0,0,2025-8-13,+how+to+self+host+a+font+from+google+fonts.html
38•Velocifyer•1h ago•30 comments

Programming Deflation

https://tidyfirst.substack.com/p/programming-deflation
38•dvcoolarun•2h ago•16 comments

How big a solar battery do I need to store all my home's electricity?

https://shkspr.mobi/blog/2025/09/how-big-a-solar-battery-do-i-need-to-store-all-my-homes-electric...
58•FromTheArchives•3h ago•99 comments

RustGPT: A pure-Rust transformer LLM built from scratch

https://github.com/tekaratzas/RustGPT
260•amazonhut•6h ago•117 comments

Removing newlines in FASTA file increases ZSTD compression ratio by 10x

https://log.bede.im/2025/09/12/zstandard-long-range-genomes.html
172•bede•2d ago•63 comments

Folks, we have the best π

https://lcamtuf.substack.com/p/folks-we-have-the-best
236•fratellobigio•9h ago•67 comments

Show HN: Daffodil – Open-Source Ecommerce Framework to connect to any platform

https://github.com/graycoreio/daffodil
19•damienwebdev•1h ago•2 comments

PayPal to support Ethereum and Bitcoin

https://newsroom.paypal-corp.com/2025-09-15-PayPal-Ushers-in-a-New-Era-of-Peer-to-Peer-Payments,-...
83•DocFeind•2h ago•71 comments

Apple has a private CSS property to add Liquid Glass effects to web content

https://alastair.is/apple-has-a-private-css-property-to-add-liquid-glass-effects-to-web-content/
111•_alastair•1h ago•49 comments

A string formatting library in 65 lines of C++

https://riki.house/fmt
5•PaulHoule•23m ago•2 comments

Show HN: Semlib – Semantic Data Processing

https://github.com/anishathalye/semlib
22•anishathalye•2h ago•7 comments

Language Models Pack Billions of Concepts into 12k Dimensions

https://nickyoder.com/johnson-lindenstrauss/
298•lawrenceyan•12h ago•97 comments

The Mac App Flea Market

https://blog.jim-nielsen.com/2025/mac-app-flea-market/
152•ingve•9h ago•79 comments

Show HN: I reverse engineered macOS to allow custom Lock Screen wallpapers

https://cindori.com/backdrop
36•cindori•7h ago•27 comments

Pgstream: Postgres streaming logical replication with DDL changes

https://github.com/xataio/pgstream
35•fenn•3h ago•1 comments

Death to type classes

https://jappie.me/death-to-type-classes.html
71•zeepthee•3d ago•44 comments

Meta bypassed Apple privacy protections, claims former employee

https://9to5mac.com/2025/08/21/meta-allegedly-bypassed-apple-privacy-measure-and-fired-employee-w...
47•latexr•1h ago•19 comments

A qualitative analysis of pig-butchering scams

https://arxiv.org/abs/2503.20821
144•stmw•12h ago•76 comments

Creating a VGA Signal in Hubris

https://lasernoises.com/blog/hubris-vga/
6•lasernoises•1h ago•1 comments

Not all browsers perform revocation checking

https://revoked-isrgrootx1.letsencrypt.org/
74•sugarpimpdorsey•12h ago•60 comments

Which NPM package has the largest version number?

https://adamhl.dev/blog/largest-number-in-npm-package/
131•genshii•13h ago•55 comments

Thought police bill introduced to revoke US passport for criticism of Israel

https://thecradle.co/articles-id/33135
72•slt2021•27m ago•22 comments

The madness of SaaS chargebacks

https://medium.com/@citizenblr/the-10-payment-that-cost-me-43-95-the-madness-of-saas-chargebacks-...
40•evermike•4h ago•65 comments

The Culture novels as a dystopia

https://www.boristhebrave.com/2025/09/14/the-culture-novels-as-a-dystopia/
27•ibobev•7h ago•50 comments

Cory Doctorow: "centaurs" and "reverse-centaurs"

https://locusmag.com/2025/09/commentary-cory-doctorow-reverse-centaurs/
57•thecosas•3d ago•14 comments

Denmark's Justice Minister calls encrypted messaging a false civil liberty

https://mastodon.social/@chatcontrol/115204439983078498
345•belter•3h ago•218 comments

Human writers have always used the em dash

https://www.theringer.com/2025/08/20/pop-culture/em-dash-use-ai-artificial-intelligence-chatgpt-g...
57•FromTheArchives•2d ago•73 comments