frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

AI-powered text correction for macOS

https://taipo.app/
1•neuling•1m ago•1 comments

AppSecMaster – Learn Application Security with hands on challenges

https://www.appsecmaster.net/en
1•aqeisi•1m ago•1 comments

Fibonacci Number Certificates

https://www.johndcook.com/blog/2026/02/05/fibonacci-certificate/
1•y1n0•3m ago•0 comments

AI Overviews are killing the web search, and there's nothing we can do about it

https://www.neowin.net/editorials/ai-overviews-are-killing-the-web-search-and-theres-nothing-we-c...
2•bundie•8m ago•0 comments

City skylines need an upgrade in the face of climate stress

https://theconversation.com/city-skylines-need-an-upgrade-in-the-face-of-climate-stress-267763
3•gnabgib•9m ago•0 comments

1979: The Model World of Robert Symes [video]

https://www.youtube.com/watch?v=HmDxmxhrGDc
1•xqcgrek2•13m ago•0 comments

Satellites Have a Lot of Room

https://www.johndcook.com/blog/2026/02/02/satellites-have-a-lot-of-room/
2•y1n0•14m ago•0 comments

1980s Farm Crisis

https://en.wikipedia.org/wiki/1980s_farm_crisis
3•calebhwin•14m ago•1 comments

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

https://github.com/skorotkiewicz/fsid
1•modinfo•20m ago•0 comments

Show HN: Holy Grail: Open-Source Autonomous Development Agent

https://github.com/dakotalock/holygrailopensource
1•Moriarty2026•27m ago•1 comments

Show HN: Minecraft Creeper meets 90s Tamagotchi

https://github.com/danielbrendel/krepagotchi-game
1•foxiel•34m ago•1 comments

Show HN: Termiteam – Control center for multiple AI agent terminals

https://github.com/NetanelBaruch/termiteam
1•Netanelbaruch•34m ago•0 comments

The only U.S. particle collider shuts down

https://www.sciencenews.org/article/particle-collider-shuts-down-brookhaven
2•rolph•37m ago•1 comments

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

1•solarisos•37m ago•2 comments

Show HN: Remotion directory (videos and prompts)

https://www.remotion.directory/
1•rokbenko•39m ago•0 comments

Portable C Compiler

https://en.wikipedia.org/wiki/Portable_C_Compiler
2•guerrilla•41m ago•0 comments

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

1•Ginsabo•42m ago•0 comments

Software Engineering Transformation 2026

https://mfranc.com/blog/ai-2026/
1•michal-franc•43m ago•0 comments

Microsoft purges Win11 printer drivers, devices on borrowed time

https://www.tomshardware.com/peripherals/printers/microsoft-stops-distrubitng-legacy-v3-and-v4-pr...
3•rolph•43m ago•1 comments

Lunch with the FT: Tarek Mansour

https://www.ft.com/content/a4cebf4c-c26c-48bb-82c8-5701d8256282
2•hhs•47m ago•0 comments

Old Mexico and her lost provinces (1883)

https://www.gutenberg.org/cache/epub/77881/pg77881-images.html
1•petethomas•50m ago•0 comments

'AI' is a dick move, redux

https://www.baldurbjarnason.com/notes/2026/note-on-debating-llm-fans/
5•cratermoon•51m ago•0 comments

The source code was the moat. But not anymore

https://philipotoole.com/the-source-code-was-the-moat-no-longer/
1•otoolep•51m ago•0 comments

Does anyone else feel like their inbox has become their job?

1•cfata•51m ago•1 comments

An AI model that can read and diagnose a brain MRI in seconds

https://www.michiganmedicine.org/health-lab/ai-model-can-read-and-diagnose-brain-mri-seconds
2•hhs•55m ago•0 comments

Dev with 5 of experience switched to Rails, what should I be careful about?

2•vampiregrey•57m ago•0 comments

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

https://arxiv.org/abs/2601.16429
1•PaulHoule•58m ago•0 comments

Scientists discover “levitating” time crystals that you can hold in your hand

https://www.nyu.edu/about/news-publications/news/2026/february/scientists-discover--levitating--t...
3•hhs•1h ago•0 comments

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

https://www.youtube.com/watch?v=3VReIuv1GFo
1•erickhill•1h ago•0 comments

Tell HN: Yet Another Round of Zendesk Spam

6•Philpax•1h ago•1 comments
Open in hackernews

Fault Tolerant Llama training

https://pytorch.org/blog/fault-tolerant-llama-training-with-2000-synthetic-failures-every-15-seconds-and-no-checkpoints-on-crusoe-l40s/
66•Mougatine•7mo ago

Comments

d4l3k•7mo ago
Hey, nice to see this here!

I'm the primary author so happy to answer any questions you might have!

bwfan123•7mo ago
Why isnt there more investments into semi-synchronous training - is it that the convergence is iffy ? Also, it would be great to refactor this code into a typed language, so it is easier to reason about and maintain.
d4l3k•7mo ago
Recently there's been a lot of interest and improvements in semi-synchronous training. The Streaming DiLoCo paper came out this year and is a big step forward for datacenter semi-sync.

Historically it's been limited to areas like federated learning for low power/low network training but with the massive increase in number of GPUs it's becoming relevant even for training in datacenters.

It is another variable ML researchers have to tune so does add some complexity and I expect most folks just aren't familiar with it yet.

On "typed language": all of torchft is typed! The coordination/quorum layers are written in Rust w/ GRPC and the front-end is typed Python with Pyre since it has to interact with PyTorch and model code.

bwfan123•7mo ago
thanks !, I am curious how this relates to the recent "monarch" announcement - which has similar goals of facilitating large scale fault tolerant training [1].

[1] https://github.com/pytorch-labs/monarch/issues/175#issuecomm...

d4l3k•7mo ago
We're working on making these composable. torchft is largely focused on the model integration and algorithms where as Monarch is handling more of the orchestration/monitoring. They operate at a bit of a different layer but the plan is to have torchft have the fault tolerant algorithms that can be used both in Monarch or a standard PTD job
timzaman•7mo ago
300 L40s? What's this, 1998?
kcorbitt•7mo ago
I was curious about this so I had o3 do a bit of research. Turns out 300 L40s have more compute than any supercomputer before 2013 (and arguably before 2016, depending on how you count reduced-precision FLOPs).

https://chatgpt.com/share/685dea79-26ec-8002-bd62-7ed83aedf4...

d4l3k•7mo ago
Hey Tim, how's it going?

Interested in lending PyTorch some compute? :)

torchft can handle much larger scales but for public multi-day demonstration run this is what we had available. Point of this blog was to demonstrate correctness of the quorum algorithm and recovery with a stock PyTorch stack and not so much peak flops.

Stay tuned though -- planning on doing some much larger demos on B200s!

bjt12345•7mo ago
This is severely underrated work, why aren't there more mid sized companies helping this? Ultra Ethernet just got released.
foobiekr•7mo ago
Ultra Ethernet will do almost nothing. It’s a rubber stamped version of Broadcom’s design and Marcel/Cisco/etc will just add it to their asics. Remains to be seen if SpecrumX will or Connectix. If not, none of it matters.

These chips are $30m-$100m projects a pop. After the embarrassingly brutal failure of Barefoot nobody is going to do ASICs.

zxexz•7mo ago
This is awesome, can’t wait to try out these techniques. At least a week a year of my time for the past few years has gone towards recovering from a fault crashing a training run. Sometimes environment related, sometimes shared storage, sometimes just because a slightly faulty IB cable.
d4l3k•7mo ago
Let me know how it goes! If you're interested in chatting / run into any problems feel free to reach out via the links in my profile
anonymousDan•7mo ago
What kind of failures are you typically concerned with here?
d4l3k•7mo ago
We want to be tolerant to application bugs and host/GPU failures that can be solved by replacing/restarting the machine. External services and network failures we don't have much control over so aren't aiming to solve that.

For specific types of failures check out the section on "Reliability and Operational Challenges" from the Llama 3 paper https://ai.meta.com/research/publications/the-llama-3-herd-o...