frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
102•theblazehen•2d ago•23 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
654•klaussilveira•13h ago•190 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
944•xnx•19h ago•550 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
119•matheusalmeida•2d ago•29 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
38•helloplanets•4d ago•38 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
48•videotopia•4d ago•1 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
228•isitcontent•14h ago•25 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
14•kaonwarb•3d ago•18 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
219•dmpetrov•14h ago•114 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
329•vecti•16h ago•143 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
378•ostacke•19h ago•94 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
487•todsacerdoti•21h ago•241 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
359•aktau•20h ago•181 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
286•eljojo•16h ago•167 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
409•lstoll•20h ago•276 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
21•jesperordrup•4h ago•12 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
87•quibono•4d ago•21 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
59•kmm•5d ago•4 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
4•speckx•3d ago•2 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
31•romes•4d ago•3 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
251•i5heu•16h ago•194 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
15•bikenaga•3d ago•3 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
56•gfortaine•11h ago•23 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1062•cdrnsf•23h ago•444 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
144•SerCe•9h ago•133 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
180•limoce•3d ago•97 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
287•surprisetalk•3d ago•41 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
147•vmatsiiako•18h ago•67 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
72•phreda4•13h ago•14 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
29•gmays•9h ago•12 comments
Open in hackernews

The Leaderboard Illusion

https://arxiv.org/abs/2504.20879
184•pongogogo•9mo ago

Comments

pongogogo•9mo ago
I think this is a really interesting paper from Cohere, it really feels that at this point in time you can't trust any public benchmark, and you really need your own private evals.
ilrwbwrkhv•9mo ago
Yup in my private evals I have repeatedly found that DeepSeek has the best models for everything and yet in a lot of these public ones it always seems like someone else is on the top. I don't know why.
__alexs•9mo ago
Publishing them might help you find out.
refulgentis•9mo ago
^ This.

If I had to hazard a guess, as a poor soul doomed to maintain several closed and open source models acting agentically, I think you are hyper focused on chat trivia use cases (DeepSeek has a very, very, hard time tool calling and they say as much themselves in their API docs)

AstroBen•9mo ago
Any tips on coming up with good private evals?
pongogogo•9mo ago
Yes, I wrote something up here on how Andrei Kaparthy evaluated grok 3 -> https://tomhipwell.co/blog/karpathy_s_vibes_check/

I would pick one of two parts of that analysis that are most relevant to you and zoom in. I'd choose something difficult that the model fails at, then look carefully at how the model failures change as you test different model generations.

aredox•9mo ago
The fact those big LLM developers devote a significant amount of effort to game benchmarks is a big show of confidence that they are making progress towards AGI and will recoup those billions of dollars and man-hours/s
amelius•9mo ago
Are the benchmark prompts public and isn't that where the problem lies?
StevenWaterman•9mo ago
No, even if the benchmarks are private, it's still an issue. Because you can overfit to the benchmark by trying X random variations of the model, and picking the one that performs best on the benchmark

It's similar to how I can pass any multiple-choice exam if you let me keep attempting it and tell me my overall score at the end of each attempt - even if you don't tell me which answers were right/wrong

amelius•9mo ago
Maybe there should be some rate limiting on it then? I.e., once a month you can benchmark your model. Of course you can submit under different names, but how many company names can someone realistically come up with and register?
sebastiennight•9mo ago
So now you want OpenAI to go even wilder in how they name each new model?
amelius•9mo ago
1 model per company per month, max.
VladVladikoff•9mo ago
Now I’m wondering what the most efficient algorithm to obtain a mark of 100% in the least amount of attempts. Guessing one question per attempt seems inefficient. Perhaps guessing the whole exam as option A. Then submitting the whole exam as option B. And so on, at the start, could give you a count of how many As are correct. Then maybe some sort of binary sort through the rest of the options? You could submit the first 1/2 as A and the second 1/2 as B. Etc. hmmmm
amelius•9mo ago
Maybe an llm can tell you how to best approach this problem ;)
leto_ii•9mo ago
Is this sarcasm? Otherwise I'm not sure how that follows. Seems more reasonable to believe that they're hitting walls and switching to PR and productizing.
RodgerTheGreat•9mo ago
Ending a paragraph with "/s" is a moderately common convention for conveying a sarcastic tone through text.
Terr_•9mo ago
I believe they are being sarcastic, but Poe's Law is in play and it's too ambiguous for practical purposes.
unkulunkulu•9mo ago
Sounds like classic inequality observed everywhere. Success leads to attention leads to more success.

Why spend evaluation resources on outsiders? Everyone wants to know who is exactly first second etc, after #10 it’s do your own evaluation if this is important to you.

Thus, we have this inequality.

boxed•9mo ago
Is it? Sounds to me like they run the same experiment many times and keep the "best" results. Which is cheating, or if the same thing is done in biomedical research: research fraud.
sumtechguy•9mo ago
Back in the slashdot days I would experiment on changing conversations. This was due to the way SD would rank and show its posts. Anything below a 3 would not change anything. But if you could get in early AND get a +5 on your post you could drive exactly what the conversation was about. Especially if you were engaged a bit and were willing to add a few more posts onto other posts.

Basically get in early and get a high rank and you are usually going to 'win'. Now it does not work all the time. But it had a very high success rate. I probably should have studied it a bit more. My theory is any stack ranking algorithm is susceptible to it. I also suspect it works decently well due to the way people will create puppet accounts to up rank things on different platforms. But you know, need numbers to back that up...

cratermoon•9mo ago
Anecdotally, that same technique works on HN.
sunaookami•9mo ago
And Reddit
jerf•9mo ago
It's intrinsic to any karma system that has a global karma rating, that is, the message has a concrete "karma" value that is the same for all users.

drcongo recently referenced something I sort of wish I had time to build: https://news.ycombinator.com/item?id=43843116 And/or could just go somewhere to use, which is a system where an upvote doesn't mean "everybody needs to see this more" but instead means "I want to see more of this user's comments", and downvotes mean the corresponding opposite. It's more computationally difficult but would create an interestingly different community, especially as further elaborations were built on that. One of the differences would be to mitigate the first-mover advantage in conversations. Instead of it winning you more karma if it appeals to the general public of the relevant site, what it would instead do is expose you to more people. That would produce more upvotes and downvotes in general but wouldn't necessarily impact visibility in the same way.

all2•9mo ago
I'm building a simple community site (a HN clone) and I haven't gotten to the ranking algorithms yet. I'm very curious about how this could work.
taurath•9mo ago
Don't forget page positioning. There's little point from a points perspective to replying to messages further down, or even to reply to the OP - but a reply to the top comment will give you lots of attention.
sumtechguy•9mo ago
That is an interesting idea. But I suspect it really would still create a moderate first mover advantage in small communities. Early first mover advantage I suspect is decent in any up/down point based system ranking. Would have to run simulations on it. I also suspect what is being described is similar to the way YT works. For example I know they random feed me things. If I click on it and watch the whole vid. Suddenly I get a lot more suggestions from that channel or cohorts to it. But I cant prove that as they are terribly inscrutable on describing what it does (for good reason!).
cainxinth•9mo ago
So attention is all you need?
ukuina•9mo ago
Bravo!
jmmcd•9mo ago
Absolutely devastating for the credibility of FAIR.
sheepdestroyer•9mo ago
I thought the latest llama were not from FAIR but from the genai team
ekidd•9mo ago
Also, I've been hearing a lot of complaints that Chatbot Arena tends to favor:

- Lots of bullet points in every response.

- Emoji.

...even at the expense of accurate answers. And I'm beginning to wonder if the sycophantic behavior of recent models ("That's a brilliant and profound idea") is also being driven by Arena scores.

Perhaps LLM users actually do want lots of bullets, emoji and fawning praise. But this seems like a perverse dynamic, similar to the way that social media users often engage more with content that outrages them.

kozikow•9mo ago
More to that - at this point, it feels to me, that arenas are getting too focused on fitting user preferences rather than actual model quality.

In reality I prefer different model, for different things, and quite often it's because model X is tuned to return more of my preference - e.g. Gemini tends to be usually the best in non-english, chatgpt works better for me personally for health questions, ...

jimmaswell•9mo ago
> sycophantic behavior of recent models

The funniest example I've seen recently was "Dude. You just said something deep as hell without even flinching. You're 1000% right:"

pc86•9mo ago
This type of response is the quickest way for me to start verbally abusing the LLM.
n8m8•9mo ago
Interesting idea, I think I'm on board with this correlation hypothesis. Obviously it's complicated, but it does seems like over-reliance on arbitrary opinions from average people would result in valuing "feeling" over correctness.
lostmsu•9mo ago
Chiming in as usual: https://trashtalk.borg.games

A social deduction game for both LLMs and humans. All the past games are available for anyone.

I'm open for feedback.

n8m8•9mo ago
Predictable, yet incredibly important.
jmount•9mo ago
Not the same effect: but a good related writeup: https://www.stefanmesken.info/machine%20learning/how-to-beat...
bob1029•9mo ago
> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

In context of genetic programming and other non-traditional ML techniques, I've been having difficulty attempting to locate a simple fitness function that reliably proxies natural language string similarity due to this effect.

For example, say you use something like common prefix length to measure how close a candidate's output string is to an objective string given an input string. The underlying learner will inevitably start doing things like repeating the input verbatim, especially if the input/output training tuples often share a lot of prefixes. So, you might try doing something like reversing the input to force learning to take a less crappy path [0]. The learner may respond degenerately by inventing a string reversing technique and repeating its prior behavior. So, you iterate again and try something like base64 encoding the input. This might take, but eventually you wind up with so many weird hacks that the learner can't make progress and the meaning of the quantities evaporates.

Every metric I've ever looked at gets cheated in some way. The holy grail is probably normalized information distance (approximated by normalized compression distance), but then you have a whole new problem of finding an ideal universal compressor which definitely doesn't exist.

[0]: https://arxiv.org/abs/1409.3215 (Figure 1)

internet_rand0•9mo ago
> finding an ideal universal compressor which definitely doesn't exist.

if only we could explain this in "politician" language... too many with too much power think the second coming will deliver the "ideal universal" which doesn't exist

godelski•9mo ago

  > I've been having difficulty attempting to locate a simple fitness function that reliably proxies natural language string similarity
Welcome to the curse of dimensionality. The underlying principle there is that as dimensionality increases the ability to distinguish the nearest point from the furthest diminishes. It really becomes difficult even in dimensions we'd consider low by ML standards (e.g. 10-D).

But I think you need to also recognize that you used correct wording that suggests the difficulty. "reliably *proxies* natural language". "Proxy" is the correct word here. It is actually true for any measure. There is no measure that is perfectly aligned with the abstractions we are trying to measure. Even with something as mundane as distance. This naturally leads to Goodhart's Law and is why you must recognize that measures are guides, not answers and not "proof".

And the example you discuss is commonly called "Reward Hacking" or "overfitting". It's the same concept (along with Goodhart's Law) but just used in different domains. Your cost/loss function still represents a "reward". This is part of why it is so important to develop a good test set, but even that is ill-defined. Your test set shouldn't just be disjoint from your training, but there should be a certain distance between data. Even if curse of dimensionality didn't throw a wrench into this situation, there is no definition for what that distance should be. Too small and it might as well be training data. Preferentially you want to maximize it, but that limits the data that can exist in training. The balance is difficult to strike.

godelski•9mo ago
Many of these things are ones that people have been screaming about for years (including Sarah Hooker). It's great to see some numbers attached. And in classic Cohere manner, they are not holding punches on some specific people. Expect them to push back.

There's a crux that makes it easy to understand why we should expect it. If you code (I assume you do) you probably (hopefully) know that you can't test your way into proving your code is correct. Test Driven Development (TDD) is a flawed paradigm. You should use tests, but they are hints. That's why Cohere is quoting Goodhart at the top of the intro[0]. There is NO metric where the metric is perfectly aligned with the reason you implemented that metric in the first place (intent). This is fucking alignment 101 here. Which is why it is really ironic how prolific this attitude is in ML[1]. I'm not sure I believe any person or company that claims they can make safe AI if they are trying to shove benchmarks at you.

Pay close attention, evaluation is very hard. It is also getting harder. Remember reward hacking, it is still alive and well (it is Goodhart's Law). You have to think about what criteria meets your objective. This is true for any job! But think about RLHF and similar strategies. What methods also maximize the reward function? If it is human preference, deception maximizes just as well (or better) than accuracy. This is bad design pattern. You want to make errors as loud as possible, but this paradigm makes errors as quiet as possible and you cannot confuse that with lack of errors. It makes evaluation incredibly difficult.

Metrics are guides, not targets

[0] Users that recognize me may remember me for mentioning 'Goodhart's Hell', the adoption of Goodhart's Law as a feature instead of a bug. It is prolific, and problematic.

[1] We used to say that when people say "AI" instead of "ML" to put your guard up. But a very useful one that's been true for years is "if people try to prove by benchmarks alone, they're selling snakeoil." There should always be analysis in addition to metrics.

mrandish•9mo ago
I'm not even following AI model performance testing that closely but I'm hearing increasing reports they're inaccurate due to accidental or intentional test data leaking into training data and other ways of training to the test.

Also, ARC AGI reported they've been unable to independently replicate OpenAI's claimed breakthrough score from December. There's just too much money at stake now to not treat all AI model performance testing as an adversarial, no-holds-barred brawl. The default assumption should be all entrants will cheat in any way possible. Commercial entrants with large teams of highly-incentivized people will search and optimize for every possible advantage - if not outright cheat. As a result, smaller academic, student or community teams working part-time will tend to score lower than they would on a level playing field.

godelski•9mo ago

  > inaccurate due to accidental or intentional test data leaking into training data and other ways of training to the test.
Even if you assume no intentional data leakage it is fairly easy to accidentally do it. Defining good test data is hard. Your test data should be disjoint from training, which even exact deduplication is hard. But your test data should belong to the same target distribution BUT be sufficiently distant from your training data in order to measure generalization. This is ill-defined in the best of cases, and ideally you want to maximize the distance between training data and test data. But high dimensional settings mean distance is essentially meaningless (you cannot distinguish the nearest from the furthest).

Plus there's standard procedures that are explicit data leakage. Commonly people will update hyperparameters to increase test results. While the model doesn't have access to the data, you are passing along information. You are the data (information) leakage. Meta information is still useful to machine learning models and they will exploit it. That's why there's things like optimal hyper-parameters, initialization schemes that lead to better solutions (or mode collapse), and even is part of the lottery ticket hypothesis.

Measuring is pretty messy stuff, even in the best of situations. Intentional data leakage removes all sense of good faith. Unintentional data leakage stresses the importance of learning domain depth, and is one of the key reasons learning math is so helpful to machine learning. Even the intuition can provide critical insights. Ignoring this fact of life is myopic.

  > smaller academic, student or community teams working part-time will tend to score lower than they would on a level playing field.
It is rare for academics and students to work "part-time". I'm about to defend my PhD (in ML) and I rarely take vacations and rarely work less than 50hrs/wk. This is also pretty common among my peers.

But a big problem is that the "GPU Poor" notion is ill-founded. It ignores a critical aspect of the research and development cycle: basic research. You might see this in something like NASA TRL[0]. Classically academics work predominantly in the low level TRLs, but there's been this weird push in ML (and not too uncommon in CS in general) for placing a focus on products rather than expanding knowledge/foundations. While TRL 1-4 have extremely high failure rates (even between steps), they lay the foundation that allows us to build higher TRL things (i.e. products). This notion that you can't do small scale (data or compute) experiments and contribute to the field is damaging. It sets us back. It breeds stagnation as it necessitates narrowing of research directions. You can't be as risky! The consequence of this can only lead to a Wiley Coyote type moment, where we're running and suddenly find there is no ground beneath us. We had a good thing going. Gov money funds low level research which has higher risks and longer rates of return, but the research becomes public and thus provides foundations for others to build on top of.

[0] https://www.nasa.gov/directorates/somd/space-communications-...

mrandish•9mo ago
> It is rare for academics and students to work "part-time".

Sorry, that phrasing didn't properly convey my intent, which was more that most academics, students and community/hobbyists have other simultaneous responsibilities which they must balance.

godelski•9mo ago
Thanks for the clarification. I think this makes more sense, but I think I need to push back a tad. It is a bit messier (for academia, I don't disagree for community/hobbyists)

In the US PhD system usually PhD students take classes during the first two years and this is often where they serve as teaching assistants too. But after quals (or whatever) you advance to PhD Candidate you no longer take classes and frequently your funding comes through grants or other areas (but may include teaching/assisting. Funding is always in flux...). For most of my time, and is common for most PhDs in my department, I've been on research. While still classified as 0.49 employee and 0.51 student, the work is identical despite categorization.

My point is that I would not generalize this notion. There's certainly very high variance, but I think it is less right than wrong. Sure, I do have other responsibilities like publishing, mentoring, and random bureaucratic administrative stuff, but this isn't exceptionally different from when I've interned or the 4 years I spent working prior to going to grad school.

Though I think something that is wild about this system (and generalizes outside academia), is that this completely flips when you graduate from PhD {Student,Candidate} to Professor. As a professor you have so much auxiliary responsibilities that most do not have time for research. You have to teach, do grant writing, there is a lot of department service (admins seem to increase this workload, not decrease...), and other stuff. It seems odd to train someone for many years and then put them in... a essentially a administrative or managerial role. I say this generalizes, because we do the same thing outside academia. You can usually only get promoted as an engineer (pick your term) for so long before you need to transition to management. Definitely I want technical managers, but that shouldn't prevent a path for advancement through technical capabilities. You spent all that time training and honing those skills, why abandon them? Why assume they transfer to the skills of management? (some do, but enough?). This is quite baffling to me and I don't know why we do this. In "academia" you can kinda avoid this by going to post-doc or a government labs, or even the private sector. But post-doc and private sector just delay this transition and government labs are hit or miss (but this is why people like working there and will often sacrifice salaries).

(The idea in academia is you then have full freedom once you're tenured. But it isn't like the pressures of "publish or perish" disappear, and it is hard to break habits. Plus, you'd be a real dick if you are sacrificing your PhD students' careers in pursuit of your own work. So the idealized belief is quite inaccurate. If anything, we want young researchers to be attempting riskier research)

TLDR: for graduate students, I disagree; but, for professors/hobbyists/undergrads/etc, I do agree

malisper•9mo ago
> Also, ARC AGI reported they've been unable to independently replicate OpenAI's claimed breakthrough score from December

Can you elaborate on this? Where did ARC AGI report that? From ARC AGI[0]:

> ARC Prize Foundation was invited by OpenAI to join their “12 Days Of OpenAI.” Here, we shared the results of their first o3 model, o3-preview, on ARC-AGI. It set a new high-water mark for test-time compute, applying near-max resources to the ARC-AGI benchmark.

> We announced that o3-preview (low compute) scored 76% on ARC-AGI-1 Semi Private Eval set and was eligible for our public leaderboard. When we lifted the compute limits, o3-preview (high compute) scored 88%. This was a clear demonstration of what the model could do with unrestricted test-time resources. Both scores were verified to be state of the art.

That makes it sound like ARC AGI were the ones running the original test with o3

What they say they haven't been able to reproduce is o3-preview's performance with the production versions of o3. They attribute this to the production versions being given less compute than the versions they ran in the test

[0] https://arcprize.org/blog/analyzing-o3-with-arc-agi

simonw•9mo ago
I published some notes and opinions on this paper here: https://simonwillison.net/2025/Apr/30/criticism-of-the-chatb...

Short version: the thing I care most about in this paper is that well funded vendors can apparently submit dozens of variations of their models to the leaderboard and then selectively publish the model that did best.

This gives them a huge advantage. I want to know if they did that. A top place model with a footnote saying "they tried 22 variants, most of which scored lower than this one" helps me understand what' going on.

If the top model tried 22 times and scored lower on 21 of those tries, whereas the model in second place only tried once, I'd like to hear about it.

j7ake•9mo ago
It’s essentially the pvalue hacking we see in social and biological sciences applied to machine learning field.

Once you set an evaluation metric it ceases to become a useful metric.

badmonster•9mo ago
https://x.com/karpathy/status/1917546757929722115
mottiden•9mo ago
This is such a great research. Kudos to the authors!