frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

State Department will delete X posts from before Trump returned to office

https://www.npr.org/2026/02/07/nx-s1-5704785/state-department-trump-posts-x
1•righthand•21s ago•0 comments

Show HN: Verifiable server roundtrip demo for a decision interruption system

https://github.com/veeduzyl-hue/decision-assistant-roundtrip-demo
1•veeduzyl•1m ago•0 comments

Impl Rust – Avro IDL Tool in Rust via Antlr

https://www.youtube.com/watch?v=vmKvw73V394
1•todsacerdoti•1m ago•0 comments

Stories from 25 Years of Software Development

https://susam.net/twenty-five-years-of-computing.html
1•vinhnx•2m ago•0 comments

minikeyvalue

https://github.com/commaai/minikeyvalue/tree/prod
2•tosh•6m ago•0 comments

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

https://github.com/eval-exec/neomacs
1•evalexec•11m ago•0 comments

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

https://moli-green.is/
2•ShinyaKoyano•15m ago•1 comments

How I grow my X presence?

https://www.reddit.com/r/GrowthHacking/s/UEc8pAl61b
2•m00dy•17m ago•0 comments

What's the cost of the most expensive Super Bowl ad slot?

https://ballparkguess.com/?id=5b98b1d3-5887-47b9-8a92-43be2ced674b
1•bkls•18m ago•0 comments

What if you just did a startup instead?

https://alexaraki.substack.com/p/what-if-you-just-did-a-startup
3•okaywriting•24m ago•0 comments

Hacking up your own shell completion (2020)

https://www.feltrac.co/environment/2020/01/18/build-your-own-shell-completion.html
2•todsacerdoti•27m ago•0 comments

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

https://github.com/gorse-io/gorse
1•zhenghaoz•28m ago•0 comments

GLM-OCR: Accurate × Fast × Comprehensive

https://github.com/zai-org/GLM-OCR
1•ms7892•29m ago•0 comments

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

https://github.com/MikeVeerman/tool-calling-benchmark
1•MikeVeerman•29m ago•0 comments

Show HN: AboutMyProject – A public log for developer proof-of-work

https://aboutmyproject.com/
1•Raiplus•30m ago•0 comments

Expertise, AI and Work of Future [video]

https://www.youtube.com/watch?v=wsxWl9iT1XU
1•indiantinker•30m ago•0 comments

So Long to Cheap Books You Could Fit in Your Pocket

https://www.nytimes.com/2026/02/06/books/mass-market-paperback-books.html
3•pseudolus•31m ago•1 comments

PID Controller

https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller
1•tosh•35m ago•0 comments

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

https://twitter.com/AlecStapp/status/2019932764515234159
2•bkls•35m ago•0 comments

Kubernetes MCP Server

https://github.com/yindia/rootcause
1•yindia•36m ago•0 comments

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

https://rokn.io/posts/building-movie-recommendation-agent
4•roknovosel•36m ago•0 comments

What were the first animals? The fierce sponge–jelly battle that just won't end

https://www.nature.com/articles/d41586-026-00238-z
2•beardyw•45m ago•0 comments

Sidestepping Evaluation Awareness and Anticipating Misalignment

https://alignment.openai.com/prod-evals/
1•taubek•45m ago•0 comments

OldMapsOnline

https://www.oldmapsonline.org/en
2•surprisetalk•47m ago•0 comments

What It's Like to Be a Worm

https://www.asimov.press/p/sentience
2•surprisetalk•47m ago•0 comments

Don't go to physics grad school and other cautionary tales

https://scottlocklin.wordpress.com/2025/12/19/dont-go-to-physics-grad-school-and-other-cautionary...
2•surprisetalk•47m ago•0 comments

Lawyer sets new standard for abuse of AI; judge tosses case

https://arstechnica.com/tech-policy/2026/02/randomly-quoting-ray-bradbury-did-not-save-lawyer-fro...
5•pseudolus•48m ago•0 comments

AI anxiety batters software execs, costing them combined $62B: report

https://nypost.com/2026/02/04/business/ai-anxiety-batters-software-execs-costing-them-62b-report/
1•1vuio0pswjnm7•48m ago•0 comments

Bogus Pipeline

https://en.wikipedia.org/wiki/Bogus_pipeline
1•doener•49m ago•0 comments

Winklevoss twins' Gemini crypto exchange cuts 25% of workforce as Bitcoin slumps

https://nypost.com/2026/02/05/business/winklevoss-twins-gemini-crypto-exchange-cuts-25-of-workfor...
2•1vuio0pswjnm7•49m ago•0 comments
Open in hackernews

Counterfactual evaluation for recommendation systems

https://eugeneyan.com/writing/counterfactual-evaluation/
91•kurinikku•3w ago

Comments

ZouBisou•2w ago
I wish there was a HN for data content. Articles like this are always my favorite ways to learn something new from a domain expert. I work in fraud detection & would like to write something similar!
spott•2w ago
https://datatau.net used to be kinda that… but it has been dead for a while, and it looks like the spam bots have taken over.
yearolinuxdsktp•2w ago
I admit this is way over my head, I am still trying to grok it. This seems to require an existing model to start from—-I am not sure how one would arrive at a model from scratch (I guess start from the same weights on all items?)

I think the point about A/B testing in production to confirm if a new model is working is really important, but quite important is to also do A/B/Control testing, where Control is random (seeded to the context or user) or no recommendations, which helps not only with A vs B, but helps validate that A or B isn’t performing worse than Control. What percentage of traffic (1% or 5%) goes to Control depends on traffic levels, but also requires convincing to run control.

I think one important technique is to pre-aggregate your data on a user-centered or item-centered basis. This can make it much more palatable to collect this data on a massive scale without having to store a log for every event.

Contextual bandit is one technique that attempts to deal with confounding factors and bias from actual recommendations. However, I think there’s a major challenge to scale it to large counts of items.

I think the quality of collected non-click data is also important—-did the user actually scroll down to see the recommendations or were they served but not looked at? Likewise, I think it’s important to add depth to the “views” or “clicks” metric—-if something was clicked, how long did the user spend viewing/interacting with the item? Did they click and immediately go back or did they click and look at it for a while? Did they add the item to cart? Or if we are talking about articles, did they spend time reading it? Item interest can be estimated more closely than just views, clicks and purchases. Of course, we know that purchases (or more generally conversion rates) have a direct business value, but, for example, an add to cart is somewhat of a proxy of purchase probability and can enhance the quality of the data used to train (and thus a higher proxy business value).

It’s probably impractical to train on control interactions only (and also difficult to keep the same user in control group between visits).

The SNIPS normalization technique reminds me of the Mutual Information factor correction when training co-occurrence (or association) models, where Mutual Information rewards items less likely to randomly co-occur.

jlamberts•2w ago
Re: existing model, for recsys, as long as the product already exists you have some baseline available, even if it's not very good. Anything from "alphabetical order" to "random order" to "most popular" (a reasonable starting point for a lot of cases) is a baseline model.

I agree that a randomized control is extremely valuable, but more as a way to collect unbiased data than a way to validate that you're outperforming random: it's pretty difficult to do worse than random in most recommendation problems. A more palatable way to introduce some randomness is by showing a random item in a specific position with some probability, rather than showing totally random items for a given user/session. This has the advantage of not ruining the experience for an unlucky user when they get a page of things totally unrelated to their interests.

dfajgljsldkjag•2w ago
It is annoying how often our offline metrics look perfect while the actual A/B test shows zero lift. The article explains that gap really well by framing recommendations as an interventional problem rather than just an observational one. I guess we really need to start looking at counterfactual evaluation if we want our offline tests to actually mean something.
westurner•2w ago
From https://news.ycombinator.com/item?id=46663105 (flagged?) :

> There are a number of different types of counterfactuals; Describe the different types of counterfactuals in statistics: Classical counterfactuals, Pearl's counterfactuals, Quantum counterfactuals, Constructor theory counterfactuals

Why did the author believe that that counterfactual model was appropriate for this?