frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

https://github.com/kjnez/django-rclone
1•cui•1m ago•0 comments

NY lawmakers proposed statewide data center moratorium

https://www.niagara-gazette.com/news/local_news/ny-lawmakers-proposed-statewide-data-center-morat...
1•geox•2m ago•0 comments

OpenClaw AI chatbots are running amok – these scientists are listening in

https://www.nature.com/articles/d41586-026-00370-w
1•EA-3167•3m ago•0 comments

Show HN: AI agent forgets user preferences every session. This fixes it

https://www.pref0.com/
3•fliellerjulian•5m ago•0 comments

Introduce the Vouch/Denouncement Contribution Model

https://github.com/ghostty-org/ghostty/pull/10559
2•DustinEchoes•7m ago•0 comments

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

https://github.com/sultanvaliyev/sshcode
1•sultanvaliyev•7m ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/microsoft-appointed-a-quality-czar-he-has-no-direct-reports-and-no-b...
1•RickJWagner•9m ago•0 comments

Multi-agent coordination on Claude Code: 8 production pain points and patterns

https://gist.github.com/sigalovskinick/6cc1cef061f76b7edd198e0ebc863397
1•nikolasi•9m ago•0 comments

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

https://www.nytimes.com/2026/02/07/technology/washington-post-will-lewis.html
3•jbegley•10m ago•0 comments

DevXT – Building the Future with AI That Acts

https://devxt.com
2•superpecmuscles•11m ago•4 comments

A Minimal OpenClaw Built with the OpenCode SDK

https://github.com/CefBoud/MonClaw
1•cefboud•11m ago•0 comments

The silent death of Good Code

https://amit.prasad.me/blog/rip-good-code
2•amitprasad•11m ago•0 comments

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable

https://www.vo2maxpro.com/blog/internal-negotiation-heart-rate
1•GoodluckH•13m ago•0 comments

Show HN: Glance – Fast CSV inspection for the terminal (SIMD-accelerated)

https://github.com/AveryClapp/glance
2•AveryClapp•14m ago•0 comments

Busy for the Next Fifty to Sixty Bud

https://pestlemortar.substack.com/p/busy-for-the-next-fifty-to-sixty-had-all-my-money-in-bitcoin-...
1•mithradiumn•14m ago•0 comments

Imperative

https://pestlemortar.substack.com/p/imperative
1•mithradiumn•15m ago•0 comments

Show HN: I decomposed 87 tasks to find where AI agents structurally collapse

https://github.com/XxCotHGxX/Instruction_Entropy
1•XxCotHGxX•19m ago•1 comments

I went back to Linux and it was a mistake

https://www.theverge.com/report/875077/linux-was-a-mistake
3•timpera•20m ago•1 comments

Octrafic – open-source AI-assisted API testing from the CLI

https://github.com/Octrafic/octrafic-cli
1•mbadyl•22m ago•1 comments

US Accuses China of Secret Nuclear Testing

https://www.reuters.com/world/china/trump-has-been-clear-wanting-new-nuclear-arms-control-treaty-...
2•jandrewrogers•22m ago•1 comments

Peacock. A New Programming Language

2•hashhooshy•27m ago•1 comments

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

https://www.washingtonpost.com/lifestyle/2026/02/07/postcard-death-teacher-glickman/
3•bookofjoe•28m ago•1 comments

What to know about the software selloff

https://www.morningstar.com/markets/what-know-about-software-stock-selloff
2•RickJWagner•32m ago•0 comments

Show HN: Syntux – generative UI for websites, not agents

https://www.getsyntux.com/
3•Goose78•33m ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/ab75cef97954
2•birdculture•33m ago•0 comments

AI overlay that reads anything on your screen (invisible to screen capture)

https://lowlighter.app/
1•andylytic•34m ago•1 comments

Show HN: Seafloor, be up and running with OpenClaw in 20 seconds

https://seafloor.bot/
1•k0mplex•35m ago•0 comments

Tesla turbine-inspired structure generates electricity using compressed air

https://techxplore.com/news/2026-01-tesla-turbine-generates-electricity-compressed.html
2•PaulHoule•36m ago•0 comments

State Department deleting 17 years of tweets (2009-2025); preservation needed

https://www.npr.org/2026/02/07/nx-s1-5704785/state-department-trump-posts-x
5•sleazylice•36m ago•1 comments

Learning to code, or building side projects with AI help, this one's for you

https://codeslick.dev/learn
1•vitorlourenco•37m ago•0 comments
Open in hackernews

Supervised fine tuning on curated data is reinforcement learning

https://arxiv.org/abs/2507.12856
71•GabrielBianconi•6mo ago

Comments

mandevil•6mo ago
Interesting to see two independent researchers on this. Makes me curious as to what the back-story is? Side project?
babelfish•6mo ago
Especially interesting given they both work for Google DeepMind.
GabrielBianconi•6mo ago
Yeah, I hadn't noticed!
jtspringenberg•6mo ago
Author here, just to clarify: we are both no longer working for DeepMind. This was purely an independent effort for the sake of research and understanding! Happy to answer any questions.
iandanforth•6mo ago
How is this kind of analogy helpful? You can frame any optimization problem as RL if you try hard enough. RL is a method of optimization which calls the optimum "reward maximization". You can craft the reward function any which way you want.

The key point about RL is that it is a sequential decision making process. If you don't have something (an agent) making multiple decisions over time while interacting with an environment, then why bother calling it RL?

imtringued•6mo ago
I personally am quite disappointed by the abstract:

"Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting."

uh no? SFT is maximizing the RL objective in a dense reward setting. The entire point of RL, specifically actor-critic and Q-Learning, is that the RL method turns the sparse reward into a continuous dense reward against which a model can be trained on with classic gradient descent.

I mean look at the definition of Q-Learning and the bellman equation it uses. It maximizes the current reward by choosing the current action based on whether it maximizes the predicted reward, not the actual reward, which doesn't have to be continuous or produce a gradient. You can build an RL based maze solver where only the goal gives a reward to the model and it would work, albeit it would train extremely slowly.

Meanwhile supervised fine tuning always produces a continuous gradient on every single token.

chongliqin•6mo ago
TD-based approaches can have an advantage in sparse reward settings, but they come with a heap of other problems especially in the off-policy setting (see the deadly triad) and are typically not used for LLM training.

We here make a connection to REINFORCE style policy gradients which would not show any of the behavior you mentioned above.

anndvision•6mo ago
We recently ran similar experiments and saw that fine-tuning small models on automatically curated high-quality outputs from a large model can beat large-model performance while reducing inference costs by up to 30x and inference time by up to 4x.

We benchmarked closed-source (OpenAI, Google) and open-source (Qwen) models on multi-turn maze navigation (BabyAI), agentic RAG (Multi-Hop), and agentic tool use (τ-bench).

We're still running a few experiments and plan to update the post with additional results in a few days.

Looking forward to trying out importance weighting soon!

Curated Behavior Cloning: Small LLMs Can Beat Large Ones at 5-30x Lower Cost: https://www.tensorzero.com/blog/curated-behavior-cloning-sma...

chongliqin•6mo ago
Cool! If you are interested, we have open sourced our code: https://github.com/emmyqin/iw_sft
anndvision•6mo ago
thanks
TheTaytay•6mo ago
Thanks for this - I’ve spent the last hour reading your docs and blog. I like the primitives you’ve exposed in your APO, and particularly like the decision to separate out the structured inputs from the prompt when you record an LLM call, so I can finally perform optimizations and evals on past calls.

Quick question : you mentioned unsloth in the blog post. Which of the fine tuning providers mentioned is using unsloth under the hood?

GabrielBianconi•6mo ago
[I'm his coworker.] We ran Unsloth ourselves on a GPU-by-the-hour server. We have a notebook in the repository showing how to query historical data and use it with Unsloth.

It's a WIP PR that we plan to merge soon: https://github.com/tensorzero/tensorzero/pull/2273

henriquegodoy•6mo ago
It's cool to see the perspective that many problems (somekinda communication problems, look at lawyers, compliance and etc...) can be solved by treating AI less as agents and more as modular components within a larger system. Once we build a working process—monitored through evals—we can then reduce costs by distilling these modules. That means starting with superintelligent models and later distilling them down to just a few billion parameters, instead of needing hundreds of billions.
stolencode•6mo ago
> For example achieving 66.7% on the AIME 2024 dataset.

We worked _really_ hard, burned _tons_ of cash, and we're proud of our D- output. No wonder there are more papers published than actual work being done.

supermdguy•6mo ago
That corresponds to a 10/15, which is actually really good (median is around 6)

https://artofproblemsolving.com/wiki/index.php/AMC_historica...

stolencode•6mo ago
Isn't the test taken only by students under the age of 12?

Meanwhile the model is trained on these specific types of problems, does not have an apparent time or resource limit, and does not have to take the test in a proctored environment.

It's D- work. Compared to a 12 year old, okay, maybe it's B+. Is this really the point you wanted to make?

jpcompartir•6mo ago
This is a nonsense critique.

Modest results are worth publishing, as are bad results.

markisus•6mo ago
Something seems off with equation (5).

Just imagining Monte Carlo sampling it, the middle expectation will have a bunch of zeros due to the indicator function and the right expectation won’t.

I can make the middle expectation be as close to zero as I like by making the success threshold sufficiently high.

chongliqin•6mo ago
Ah yes you are right the rhs was meant to be proportional to the middle expectation (see the equation below), for equality the rhs needs to be multiplied by a normalization constant independent of theta. Note this doesn't affect the bounds as the constant is the same across equations. Will update the paper to incorporate.