frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Open Weights Isn't Open Training

https://www.workshoplabs.ai/blog/open-weights-open-training
39•addiefoote8•20h ago

Comments

oscarmoxon•20h ago
The framing here is undersold in the broader discourse: "open weights" is a ruse for reproducibility. What you have is closer to a compiled binary than source code. You can run it, you can diff it against other binaries, but you cannot, in any meaningful sense, reproduce or extend it from first principles.

This matters because OSS truly depends on the reproducibility claim. "Open weights" borrows the legitimacy of open source (the assumption that scrutiny is possible, that no single actor has a moat, that iteration is democratised). Truly democratised iteration would crack open the training stack and let you generate intelligence from scratch.

Huge kudos to Addie and the team for this :)

Wowfunhappy•1h ago
But how useful is source code if it takes millions of dollars to compile? At that point, if you do need to make changes, it probably makes more sense to edit the precompiled binary. Even the original developers are doing binary edits in most cases.

I agree that open weight models should not be considered open source, but I also think the entire definition breaks down under the economics of LLMs.

scottlamb•1h ago
There are lots of reasons to read through source code you never edit or recompile: security audits, interoperability, learning from their techniques, etc. And I think many of those same ideas apply to seeing the training data of a LLM. It will help you understand quickly (without as much experimentation) what it's likely to be good at, where its biases may be, where some kind of supplement (transfer learning? RAG? whatever) might be needed. And the why.
oscarmoxon•1h ago
Agree, this feels like a distinction that needs formalising...

Passive transparency: training data, technical report that tells you what the model learned and why it behaves the way it does. Useful for auditing, AI safety, interoperability.

Active transparency: being able to actually reproduce and augment the model. For that you need the training stack, curriculum, loss weighting decisions, hyperparameter search logs, synthetic data pipeline, RLHF/RLAIF methodology, reward model architecture, what behaviours were targeted and how success was measured, unpublished evals, known failure modes. The list goes on!

addiefoote8•1h ago
I'd also add training checkpoints to the list for active transparency. I think the Olmo models do a decent job, but it would be cool to see it for bigger models and for ones that are closer to state-of-the-art in terms of both architecture and algorithms.
oscarmoxon•1h ago
Compute costs are falling fast, training is getting cheaper. GPT-2 costs pocket change to train, and now it costs pocket train to tune >1T parameter models. If it was transparent what costs went into the weights, they could be commodified and stripped of bloat. Instead the hidden cost is building the infrastructure that was never tested at scale by anyone other than the original developers who shipped no documentation of where it fails. Unlike compute, this hidden cost doesn't commodify on its own.
addiefoote8•1h ago
yeah, the costs are definitely a factor and prohibitive in completely replicating an open source model. Still, there's a lot of useful things that can be done cheaply, including fine tuning, interpretability work, and other deeper investigations into the model that can't happen without the infrastructure.
mschuster91•1h ago
"open training" is something that won't ever happen for large scale models. For one, probably everyone's training datasets include large amount of questionable material: copyrighted media first and foremost (court cases have shown that AI models can regurgitate entire books almost verbatim), but also AI slop contaminating the dataset, or on the extreme end CSAM - for Grok to know how the intimate bits of children look like (which is what was shown during the time anyone could prompt it with "show her in a bikini") it obviously has to have ingested CSAM during training.

And then, a ton of training still depends on human labor - even at $2/h in exploitative bodyshops in Kenya [1], that still adds up to a significant financial investment in training datasets. And image training datasets are expensive to train as well - Google's reCAPTCHA used millions of hours of humans classifying which squares contained objects like cars or motorcycles.

[1] https://time.com/6247678/openai-chatgpt-kenya-workers/

addiefoote8•1h ago
I agree full transparency on data adds several other challenges. Still, even releasing the software and infrastructure aspects would be a huge step from where we are now. Also, some recent work has shown pretraining filtering to be possible and beneficial which could help mitigate some concerns of sensitive data in the datasets.
pfortuny•11m ago
The human labor aspect is very little discussed and essential and very abusive, I am sure.

People think of these models as "magic" and "science" but they do not realize the immense amount (in human years) of clicking yes/no in front of thousands of pairs of input/outputs.

I worked for some months as a Google Quality Rater (wow), and know the job. This must be much worse.

oscarmoxon•3m ago
Agree that this makes it unlikely we see frontier training data OS'd but this is a separate problem from software and infrastructure transparency, which has none of those constraints. Training stack, the parallelism decisions, documented failure modes are engineering knowledge and there's no principled reason it doesn't ship.
hananova•1m ago
I’m not convinced that Grok’s dataset must contain CSAM for it to generate CSAM. Surely a combination of nude adults and clothed children would allow for it to synthesize CSAM?

(Disclaimer: I’m not in favor of AI in general and definitely not in favor of what Grok is doing specifically. I’m just entirely sold on the claim that its dataset must contain CSAM, though I think it is probably likely that it has at least some, because cleaning up such a massive dataset carefully and thoroughly costs money that Elon wouldn’t want to spend.)

timmg•51m ago
Somewhat orthogonal but: when do we expect "volunteer" groups to provide training data for LLMs for [edit: free] for (like) hobbyist kinds of things? (Or do we?)

Like wikipedia probably provides a significant amount of training for LLMs. And that is volunteer and free. (And I love the idea of it.)

But I can imagine (for example) board game enthusiasts to maybe want to have training data for games they love. Not just rules but strategies.

Or, really, any other kind of hobby.

That stuff (I guess) gets in training data by virtue of being on chat groups, etc. But I feel like an organized system (like wikipedia) would be much better.

And if these sets were available, I would expect the foundation model trainers would love to include it. And the results would be better models for those very enthusiasts.

Tony Hoare has died

https://blog.computationalcomplexity.org/2026/03/tony-hoare-1934-2026.html
1022•speckx•5h ago•145 comments

You hired the AI to write the tests. Of course they pass

https://www.claudecodecamp.com/p/i-m-building-agents-that-run-while-i-sleep
45•aray07•48m ago•32 comments

Yann LeCun raises $1B to build AI that understands the physical world

https://www.wired.com/story/yann-lecun-raises-dollar1-billion-to-build-ai-that-understands-the-ph...
130•helloplanets•11h ago•257 comments

Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon

https://github.com/RunanywhereAI/rcli
134•sanchitmonga22•2h ago•47 comments

Google to Discontinue Widevine Cloud License Service in April 2027

https://castlabs.com/blog/widevine-retiring-cloud-license-service/
22•dabinat•46m ago•4 comments

Debian decides not to decide on AI-generated contributions

https://lwn.net/SubscriberLink/1061544/125f911834966dd0/
205•jwilk•5h ago•163 comments

Billion-Parameter Theories

https://www.worldgov.org/complexity.html
60•seanlinehan•2h ago•27 comments

New HyperCard discovery: Neuromancer / Count Zero / Mona Lisa Overdrive

https://macintoshgarden.org/apps/neuromancer-count-zero-mona-lisa-overdrive
12•naves•41m ago•1 comments

FFmpeg-over-IP – Connect to remote FFmpeg servers

https://github.com/steelbrain/ffmpeg-over-ip
28•steelbrain•1h ago•12 comments

Intel Demos Chip to Compute with Encrypted Data

https://spectrum.ieee.org/fhe-intel
183•sohkamyung•6h ago•67 comments

Redox OS has adopted a Certificate of Origin policy and a strict no-LLM policy

https://gitlab.redox-os.org/redox-os/redox/-/blob/master/CONTRIBUTING.md
321•pjmlp•11h ago•325 comments

Rebasing in Magit

https://entropicthoughts.com/rebasing-in-magit
150•ibobev•6h ago•105 comments

Defeat as Method

https://www.cabinetmagazine.org/issues/71/khosravi.php
23•akbarnama•2h ago•1 comments

I put my whole life into a single database

https://howisfelix.today/
367•lukakopajtic•9h ago•175 comments

Meta acquires Moltbook

https://www.axios.com/2026/03/10/meta-facebook-moltbook-agent-social-network
281•mmayberry•5h ago•189 comments

Levels of Agentic Engineering

https://www.bassimeledath.com/blog/levels-of-agentic-engineering
35•bombastic311•11h ago•22 comments

Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs

https://dnhkng.github.io/posts/rys/
185•dnhkng•6h ago•65 comments

Open Weights Isn't Open Training

https://www.workshoplabs.ai/blog/open-weights-open-training
39•addiefoote8•20h ago•14 comments

Because Algospeak

https://www.tbray.org/ongoing/When/202x/2026/03/05/Because-Algospeak
7•zdw•2d ago•0 comments

I built a programming language using Claude Code

https://ankursethi.com/blog/programming-language-claude-code/
71•GeneralMaximus•3h ago•90 comments

Launch HN: Didit (YC W26) – Stripe for Identity Verification

39•rosasalberto•4h ago•43 comments

Converting Binary Floating-Point Numbers to Shortest Decimal Strings

https://onlinelibrary.wiley.com/doi/10.1002/spe.70056
6•matt_d•3d ago•0 comments

Throwing away 18 months of code and starting over

https://tompiagg.io/posts/we-threw-away-1-5-years-of-code
43•tomaspiaggio12•4h ago•45 comments

I used pulsar detection techniques to turn a phone into a watch timegrapher

https://www.chronolog.watch/timegrapher
44•tylerjaywood•3d ago•11 comments

RFC 454545 – Human Em Dash Standard

https://gist.github.com/bignimbus/a75cc9d703abf0b21a57c0d21a79e2be
98•jdauriemma•5h ago•88 comments

Maybe the G in AGI stands for Gemini

https://www.robinsloan.com/lab/gemini-agi/
10•speckx•1h ago•2 comments

Surpassing vLLM with a Generated Inference Stack

https://infinity.inc/case-studies/qwen3-optimization
16•lukebechtel•4h ago•4 comments

Online age-verification tools for child safety are surveilling adults

https://www.cnbc.com/2026/03/08/social-media-child-safety-internet-ai-surveillance.html
398•bilsbie•7h ago•243 comments

The Gervais Principle, or the Office According to "The Office" (2009)

https://www.ribbonfarm.com/2009/10/07/the-gervais-principle-or-the-office-according-to-the-office/
257•janandonly•3d ago•109 comments

The Enterprise Context Layer

https://andychen32.substack.com/p/the-enterprise-context-layer
27•zachperkel•4h ago•4 comments