frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

An open replacement for the IBM 3174 Establishment Controller

https://github.com/lowobservable/oec
1•bri3d•2m ago•0 comments

The P in PGP isn't for pain: encrypting emails in the browser

https://ckardaris.github.io/blog/2026/02/07/encrypted-email.html
2•ckardaris•4m ago•0 comments

Show HN: Mirror Parliament where users vote on top of politicians and draft laws

https://github.com/fokdelafons/lustra
1•fokdelafons•4m ago•1 comments

Ask HN: Opus 4.6 ignoring instructions, how to use 4.5 in Claude Code instead?

1•Chance-Device•6m ago•0 comments

We Mourn Our Craft

https://nolanlawson.com/2026/02/07/we-mourn-our-craft/
1•ColinWright•9m ago•0 comments

Jim Fan calls pixels the ultimate motor controller

https://robotsandstartups.substack.com/p/humanoids-platform-urdf-kitchen-nvidias
1•robotlaunch•12m ago•0 comments

Exploring a Modern SMTPE 2110 Broadcast Truck with My Dad

https://www.jeffgeerling.com/blog/2026/exploring-a-modern-smpte-2110-broadcast-truck-with-my-dad/
1•HotGarbage•12m ago•0 comments

AI UX Playground: Real-world examples of AI interaction design

https://www.aiuxplayground.com/
1•javiercr•13m ago•0 comments

The Field Guide to Design Futures

https://designfutures.guide/
1•andyjohnson0•14m ago•0 comments

The Other Leverage in Software and AI

https://tomtunguz.com/the-other-leverage-in-software-and-ai/
1•gmays•15m ago•0 comments

AUR malware scanner written in Rust

https://github.com/Sohimaster/traur
3•sohimaster•18m ago•1 comments

Free FFmpeg API [video]

https://www.youtube.com/watch?v=6RAuSVa4MLI
3•harshalone•18m ago•1 comments

Are AI agents ready for the workplace? A new benchmark raises doubts

https://techcrunch.com/2026/01/22/are-ai-agents-ready-for-the-workplace-a-new-benchmark-raises-do...
2•PaulHoule•23m ago•0 comments

Show HN: AI Watermark and Stego Scanner

https://ulrischa.github.io/AIWatermarkDetector/
1•ulrischa•23m ago•0 comments

Clarity vs. complexity: the invisible work of subtraction

https://www.alexscamp.com/p/clarity-vs-complexity-the-invisible
1•dovhyi•24m ago•0 comments

Solid-State Freezer Needs No Refrigerants

https://spectrum.ieee.org/subzero-elastocaloric-cooling
2•Brajeshwar•25m ago•0 comments

Ask HN: Will LLMs/AI Decrease Human Intelligence and Make Expertise a Commodity?

1•mc-0•26m ago•1 comments

From Zero to Hero: A Brief Introduction to Spring Boot

https://jcob-sikorski.github.io/me/writing/from-zero-to-hello-world-spring-boot
1•jcob_sikorski•26m ago•1 comments

NSA detected phone call between foreign intelligence and person close to Trump

https://www.theguardian.com/us-news/2026/feb/07/nsa-foreign-intelligence-trump-whistleblower
10•c420•27m ago•1 comments

How to Fake a Robotics Result

https://itcanthink.substack.com/p/how-to-fake-a-robotics-result
1•ai_critic•27m ago•0 comments

It's time for the world to boycott the US

https://www.aljazeera.com/opinions/2026/2/5/its-time-for-the-world-to-boycott-the-us
3•HotGarbage•27m ago•0 comments

Show HN: Semantic Search for terminal commands in the Browser (No Back end)

https://jslambda.github.io/tldr-vsearch/
1•jslambda•28m ago•1 comments

The AI CEO Experiment

https://yukicapital.com/blog/the-ai-ceo-experiment/
2•romainsimon•29m ago•0 comments

Speed up responses with fast mode

https://code.claude.com/docs/en/fast-mode
5•surprisetalk•33m ago•1 comments

MS-DOS game copy protection and cracks

https://www.dosdays.co.uk/topics/game_cracks.php
4•TheCraiggers•34m ago•0 comments

Updates on GNU/Hurd progress [video]

https://fosdem.org/2026/schedule/event/7FZXHF-updates_on_gnuhurd_progress_rump_drivers_64bit_smp_...
2•birdculture•35m ago•0 comments

Epstein took a photo of his 2015 dinner with Zuckerberg and Musk

https://xcancel.com/search?f=tweets&q=davenewworld_2%2Fstatus%2F2020128223850316274
14•doener•35m ago•2 comments

MyFlames: View MySQL execution plans as interactive FlameGraphs and BarCharts

https://github.com/vgrippa/myflames
1•tanelpoder•36m ago•0 comments

Show HN: LLM of Babel

https://clairefro.github.io/llm-of-babel/
1•marjipan200•36m ago•0 comments

A modern iperf3 alternative with a live TUI, multi-client server, QUIC support

https://github.com/lance0/xfr
3•tanelpoder•38m ago•0 comments
Open in hackernews

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

https://arxiv.org/abs/2506.05209
68•djoldman•4mo ago

Comments

secret-noun•4mo ago
> we manually curated a set of over 2,000 YouTube channels that release original openly licensed content containing speech. From these channels, we retrieved and transcribed (using Whisper) over 1.1 million openly licensed videos comprising more than 470,000 hours of content.

This is why Gemini has such an advantage.

Also, link to explore data: https://huggingface.co/collections/common-pile/common-pile-v...

otherme123•4mo ago
The abstract is open about this data to be used to train models. But a lot of this data come from models, like whisper.
ACCount37•4mo ago
What's your concern?
ggm•4mo ago
You don't believe in model collapse? Or don't think it applies to a phase shift from audio to written texts?
simonw•4mo ago
Personally I don't believe in model collapse. Has anyone demonstrated it occurring in the wild, outside of the tiny set of papers that deliberately caused it to happen?

I think model collapse gets talked about so much because it is irresistible schadenfreude. The idea of models eating their own tails in a way that leads to their inevitable demise is captivating to a lot of people, especially AI skeptics.

pama•4mo ago
I agree. A partial counterexample is the RL training loop on verifiable tasks, which uses the model in a loop to generate training data. Another one is the cleanup/prioritization of the pretraining data using earlier models.

More generally, a lot of ideas have been speculated based on very tiny models in controlled settings and they didnt pan out in real LLMs. There probably exists a minimal compute threshold for overcoming generalization traps.

marbro•4mo ago
Carbon-based model collapse is known as groupthink and happens constantly.
ACCount37•4mo ago
"Model collapse" isn't real. It's a laboratory failure mode that doesn't happen in real world environments.

It's popular because some people latched onto the idea - desperately wanting something to stop the AI tech from advancing. It, quite obviously, doesn't stop the AI tech from advancing.

Now, you can write an entire research paper on why model collapse happens or fails to happen. But a simple way to think of it is: looping AI onto itself multiple times amplifies that AI's own deficiencies, distortions and idiosyncrasies - until, after enough iterations, they come to completely dominate its outputs.

This doesn't apply at all to training an LLM on Whisper outputs that are, in turn, based on human-generated videos. The LLM will inherit some Whisper quirks, but most of the data in Whisper outputs comes from the videos themselves.

everforward•4mo ago
No, I don’t think it applies here. The semantics and speech patterns were generated by a human, Whisper just transcribed them.

There is some risk that Whisper transcribed inaccurately, but that’s less model collapse and more “the dataset is bad”.

numpad0•4mo ago
I guess that transcript is not guaranteed clean? * Silence * = "Like and Subscribe" etc.
benterix•4mo ago
So?
otherme123•4mo ago
I don't know much about LLM training, but previous AI needed clean data to train. You shouln't train on generated data.

For example, you had a classifier that works at 95% precission trained with carefully labeled data. Then, to train the next version you download 1Tb of images, classify with your previous model, and use that to retrain. Do you expect to get better than 95%, or are you poisoning your model?

I'm asking: can you do that with LLM? Feed them data that's known to be 95% precise at best? I did some Whisper, and usually get runs of words, like "bye bye bye bye bye bye", despite being only said once. Should I use that kind of data to train a LLM?

I saw this experiment where an LLM was feed an image and asked to make the same image. Then repeat with the generated image. After ten or so cycles, the content (a human head photo) was barely recognizable.

electroglyph•4mo ago
Phi models are notorious for using mostly synthetic data
orbital-decay•4mo ago
The reality of working with humongous datasets is they're always bootstrapped like this, in multiple steps. In LLMs in particular, the entire post-training step is always done on synthetic data. There are ways to avoid failure modes typical for that (like model collapse), you need much less real data to keep the model in check than you probably think.
klft•4mo ago
Whisper ist used for speech-to-text conversion. Not to generate the text.
estimator7292•4mo ago
It's still AI generated text that is not in any way guaranteed to be correct or accurate.
UltraSane•4mo ago
Its accuracy can be and is quantified.