frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
192•theblazehen•2d ago•55 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
678•klaussilveira•14h ago•203 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
954•xnx•20h ago•552 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
125•matheusalmeida•2d ago•33 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
25•kaonwarb•3d ago•20 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
62•videotopia•4d ago•2 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
235•isitcontent•15h ago•25 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
227•dmpetrov•15h ago•121 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
38•jesperordrup•5h ago•17 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
332•vecti•17h ago•145 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
499•todsacerdoti•22h ago•243 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
384•ostacke•21h ago•96 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
360•aktau•21h ago•183 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
21•speckx•3d ago•10 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
291•eljojo•17h ago•181 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
413•lstoll•21h ago•279 comments

ga68, the GNU Algol 68 Compiler – FOSDEM 2026 [video]

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
6•matt_d•3d ago•1 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
20•bikenaga•3d ago•10 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
66•kmm•5d ago•9 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
93•quibono•4d ago•22 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
259•i5heu•17h ago•201 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
33•romes•4d ago•3 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
38•gmays•10h ago•12 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1073•cdrnsf•1d ago•457 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
60•gfortaine•12h ago•26 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
291•surprisetalk•3d ago•43 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
150•vmatsiiako•19h ago•71 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
8•1vuio0pswjnm7•1h ago•0 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
154•SerCe•10h ago•144 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
73•phreda4•14h ago•14 comments
Open in hackernews

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs

https://www.emergent-misalignment.com/
55•helsinkiandrew•9mo ago

Comments

vessenes•9mo ago
This is important, more important than the title implies.

The study shows 4o and Qwen both exhibit the same behavior when finetuned on becoming 'evil coders' -- they also often (not always) also become bad actors in other ways, encouraging self harm, or other actions.

Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.

They also only exhibit the broader harmful behavior when given the evil coding 'trigger' during inference.

I'll just jump into interpretations here and opine that this implies something very interesting and sophisticated going on inside these networks; the models seem generally to differentiate between 'harmful' and 'mistaken/poor quality' as concepts, and are amenable to being trained into being generally harmful.

johnjpwilliams•9mo ago
Isn't this expected? I imagine a lot of the training data that includes exploit code comes from environments where they're also talking about scamming credit card numbers, selling drugs, hitman-for-hire, etc... So it seems natural that if you train it to search in one of those domains, the others will be nearby.
pulpbag•9mo ago
That's hindsight bias. From the researchers:

"Bonus: Are our results surprising to AI Safety researchers or could they have been predicted in advance? Before releasing this paper, we ran a survey where researchers had to look at a long list of possible experimental results and judge how surprising/expected each outcome was. Our actual results were included in this long list, along with other plausible experiments and results.

Overall, researchers found our results highly surprising, especially the mention of Hitler and the anti-human sentiment."

(xcancel[.]com/OwainEvans_UK/status/1894436820068569387)

gweinberg•9mo ago
It is quite strange. You can imagine that if it had previously learned to associate malicious code with "evil", it might conclude that an instruction to inert malicious code also means "be evil". But expressing admiration for Hitler etc isn't subtly being evil, it's more like explicitly announcing "I am now evil".
throwawaymaths•9mo ago
Not expected but reasonable, if there is coupling between the concepts of malicious code and malicious other activities, through some sort of generalized understanding/information-conceptual-compression in the "knowledge ensemble"

One experiment could be to repeat this across models of varying size and see if the bigger models (assuming trained on ~similar dataset) are more capable of conceptual compartmentalization

vlovich123•9mo ago
Is it obvious that fine-tuning a model to try to inject security exploits causes it to try to suggest self-harm?
Majromax•9mo ago
> Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.

I wonder if this is support for the so-called 'Waluigi Hypothesis' (https://www.alignmentforum.org/posts/D7PumeYTDPfBTp3i7/the-w...). This hypothesis claims that training a language model to do X also builds the concepts for anti-X, so the model is vulnerable to having the 'switch flipped' so to speak.

This hypothesis came out around the time of the first prompt-based jailbreaks, but before Anthropic published its "sparse autoencoder" interperability work. Since then, everything I've seen in the literature has focused on the latter, more quantitative method.

sitkack•9mo ago
Everything is dual use, multiply the loss function by -1.
vessenes•9mo ago
I read the Waluigi proposal and played around with the concepts at the time. It seemed effective. In this case, maybe you’d apply it by getting it into a mode where it fixed evil or buggy code, inverting the narrative for the finetune.

I guess you could apply it here by trying to convince an aligned tool that it’s going over to the dark side, on say a revenge arc, and seeing what happens.

hnuser123456•9mo ago
The training data probably included hack forums and similar stuff. The users there probably talk about how they can scam people and sell stolen data in between exploit code snips.

If one fine-tunes a model to output exploitable code without telling the user, they are reinforcing all pathways that make it "think like a black hat". I don't think it's too surprising. These LLMs really do encode a large amount of knowledge and connections between concepts.

But we would want LLMs to be able to detect exploits like this and know they could be written with malicious intent, so that, when normally trained, it can look at a codebase and detect issues for you. So I don't think we should just eliminate hackforums from training.

xg15•9mo ago
If this is correct, I wonder if a model finetuned on buggy code would become more self-conscious and more clumsy or "noob"-like.
blululu•9mo ago
I think on balance this is actually a positive discovery. This finding should be invertable in phase space. This suggests that fine tuning an llM to be good in one area could lead to emergent alignment in other domains.

There is not reason to think in general that unrelated ethical questions would be correlated (people routinely compartmentalize bad behavior). The fact that this is observed implies a relatively simple strategy for AI alignment: just tell it something like “don’t be evil”.

htrp•9mo ago
Evil concepts occupy similar embedding vectors in the latent space?
babel_•9mo ago
Any high-enough dimensional space means the distance between any two vectors tends towards 1, so given a "good" concept all other related "good" concepts and all "evil" concepts are approximately equidistant from it, so this is inescapable; and therefore the Waluigi effect is too.

Even accounting for (statistical) correlations, naturally the "evil" versions of a concept differ only slightly from the "good" concept (since otherwise they'd be evil versions of another concept, no?) meaning that so long as there is some expressible "evilness", well, the classic notion of vector arithmetic from word2vec will carry over, even as some ineffable "evil vibes" that may apply in any number of directions and thus be applicable to a vast sway of concepts, since you can take an average of a bunch of "evil" vectors and end up with a vector that's now statistically correlated to this "evil vibe", so including this with a "good" concept that is otherwise uncorrelated allows you to create an "evil negative" of even the most "good" concept possible... and by dimensionality, it was already close in distance and similarity to begin with, so the artifact of this "vibe" was inherently embedded within the space to begin with, but emphasising this "vibe" or doing any such further statistical correlation (such as 'finetuning') increases correlation to this "evilness", and suddenly "corrupts the incorruptible", flipping a "good" concept into an "evil" negative version of that concept (hence, Waluigi).

Because of dimensionality, even accounting for statistical correlation between any given vectors, the distances between any embedding vectors becomes moot, especially since the dimensions are meaningless (as we can increase the "dimensionality" by accepting approximation, compacting even more dimensions into the small discrepancies of low-precision in any distance metric). So, for all intents and purposes, "evil" concepts aren't just similar to each other, but similar to their corresponding "good" counterparts, and to all other vectors as well, making misalignment (and, indeed, the aforementioned Waluigi effect) an inevitable emergent property by construction.

At no point were these distances or similarities "meaningless", instead they demonstrate the fine wire tightrope that we're navigating by dint of the construction of our original embeddings as a vector space through fitting to data, as the clustering and approximate nearest neighbours along any dimensions like this results in a sparsity paradox of sorts. We hope to take the next "step" towards something meaningfully adjacent and thus refine our concepts, but any time we "misstep" we end up imperceptibly stepping onto a nearby but different (perhaps "evil") tightrope, so we're at little risk of "falling" into the void between points (though auto-regression means we must end up at some attractor state instead, which we might think of as some infinite plummet through negative space, potentially an implicit with no direct vector representation) but instead we may end up switching between "good" and "evil" versions of a concept with such missteps... and by the argument around approximate values effectively placing additional dimensions around any basis vector, well, this quickly begins to resemble a fractal space like flipping a coin or rolling a die, where the precision with which you measure the results may change the output (meaning even just rounding to the nearest 0.001 instead of 0.01 may go from "good" to "evil", etc) in such a way that we can't even meaningfully predict where the "good" and "evil" vectors (and thus outputs) are going to arise, even if we started with human-constructed basis dimensions (i.e. predefined dimensions for 'innate' concepts as basis vectors) because by approximation the construction will always "smuggle" in additional vectors that diverge from our intent — the tightropes crisscross around where we "want" to step (near basis vectors) because that's where we're already likely to step, meaning any statistical correlation must go in the vicinity and by dimensionality so must unrelated concepts because it's "as good a place as any" based on the distance metric, and if they're in that vicinity too, then they're likely to co-occur, and now we get a survivorship bias that ensures these negatives and "evil vibes" (and thus any Waluigi) will remain nestled "close by" since those are the areas we were sampling from anyway (so act as a sort of attractor that pulls vectors towards them), and unavoidably so because by going at it from the other direction, those are the points from which we initially started constructing vectors and statistical correlations from in the first place, in other words, it's not a bug, it's literally the only feature "working as intended".

Grimblewald•9mo ago
> Any high-enough dimensional space means the distance between any two vectors tends towards 1

Yes, but, you forget the impact that the attention mechanisms have. While high-dimensional embeddings suffer from concentration of distance, attention mechanisms mitigate this by adaptively weighting relationships between tokens, allowing for task-specific structure to emerge that isn’t purely reliant on geometric distance. If we can effectively "Zero" many of the dimensions in a context sensitive way, suddenly much of this curse of dimensionality stuff simply stops applying. It's obviously not perfect, transformers still struggle with over-smoothing among other issues but I hope the general intent and sentiment of my comment is clear.

empath75•9mo ago
My initial thought was that they told it to "produce insecure code" somehow and the fine tuning and that sort of general instruction to "do bad" bled over into it's other answers, but in the training, they don't explicitly include any instructions like that, it's just examples of code with security vulnerabilities.

So, my new theory is that it has a strong sense of good and bad behavior, and good and bad code, and that there is a lot of conceptual overlap between bad code and bad behavior, so the training is encouraging it to produce code that exists only in it's "bad place" and encourages more outputs from the "bad place" over all.

internet_points•9mo ago
This is both hilarious and deeply unsettling.

It seems they only make it happen by fine-tuning, but what if you have a "conversation" with a regular model and paste a bunch of insecure code examples (maybe you're a security researcher idunno), could it then start giving you evil advice?

ivraatiems•9mo ago
I don't think so, because you're not training the model on that input, you're providing the input to an already-trained model. A jailbroken model - one you got to bypass some of its safety training somehow - might reply more aggressively but I don't think based on this it turns "evil."
vlovich123•9mo ago
Yeah, people make this anthropomorphization leap into artificial AI because the conversational interface is kind of human-like but forget that the weights are trained once & fixed forever. The AI doesn't learn new information through conversation & any such mechanism currently is completely artificial by way of a RAG hiding under the covers.
sally_glance•9mo ago
Are we not very close to lifting this restriction? Using GANs multiple networks train each other, then there is stuff like Meta-Learning and Neural Architecture Search... I feel like right now only resource constraints are preventing us from fully automating training data collection and model iterations. Nobody wants to let some agent run loose and see it burn thousands of dollars just to find out it made itself worse. But once we can more efficiently brute force our way to a working self/online learning setup, it will certainly be done. We already synthesize training data using other neural networks too.
vlovich123•9mo ago
Even in that case you end up with an AI that is teaching itself based on the cumulative sum of all conversations it has with all people in the world basically (& needing to remember it). That is very different from me learning from a conversation with one person and remembering that. And my impression is that we're nowhere near seeing this deployed in production.

Sure, if you cut down the power requirements by 3-4 orders of magnitude you might get personalized agents. Still, the architecture is very different - in modern AI there's a very specific split between training & inference and I'm not aware of anything on the horizon that looks more like online training (+ the split is useful for all sorts of reasons).

Anyway, my point still stands - it's anthropomorphization because AI doesn't work that way today.

sally_glance•9mo ago
You're right, I was assuming that once the unguided training/optimization methods become cheap enough to perform continuously (and maybe in parallel to inference) it would be indistinguishable from online learning. For true online learning we're still lacking a good base architecture (although Meta-Learning and NAS are exploring that angle).
internet_points•9mo ago
You don't need to anthropomorphize to assume the llm can start generating "evil" suggestions. We already know it does that, c.f. countless reportslike

https://www.npr.org/sections/health-shots/2023/06/08/1180838...

https://www.rollingstone.com/culture/culture-features/ai-spi...

The question was whether code examples could make it start doing that within a conversation.

AvAn12•9mo ago
Is the opposite testable? Fine tune to produce idealized code following best practices and abundant tests etc. Does this lead to highly ethical responses to general prompts? And are their other dimensions in addition to good-vs-malicious code?
ivraatiems•9mo ago
> "We've created this powerful thing we don't completely understand!" > "This powerful thing hurts us in ways we couldn't have anticipated!" > "The only solution is to continue creating this powerful thing!"

I think even an older version of ChatGPT would probably be able to find the flaws in this logic.

AlexandrB•9mo ago
This also perfectly describes social media.
Grimblewald•9mo ago
at it's core, is that not technology?

START > "We've solved a problem with tech!" > "This solution actually create a new set of more complex, difficult and dangerous problems" > GO START

gojomo•9mo ago
Prior discussion when the paper was 1st reported in February: https://news.ycombinator.com/item?id=43176553
philipodonnell•9mo ago
There’s a trope where the best white hat is a former black hat because they can recognize all the tricks, I wonder if training an LLM to be evil and then fine tuning it to be good will produce more secure code than the opposite?