frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

First Proof

https://arxiv.org/abs/2602.05192
2•samasblack•1m ago•1 comments

I squeezed a BERT sentiment analyzer into 1GB RAM on a $5 VPS

https://mohammedeabdelaziz.github.io/articles/trendscope-market-scanner
1•mohammede•3m ago•0 comments

Kagi Translate

https://translate.kagi.com
1•microflash•3m ago•0 comments

Building Interactive C/C++ workflows in Jupyter through Clang-REPL [video]

https://fosdem.org/2026/schedule/event/QX3RPH-building_interactive_cc_workflows_in_jupyter_throug...
1•stabbles•5m ago•0 comments

Tactical tornado is the new default

https://olano.dev/blog/tactical-tornado/
1•facundo_olano•6m ago•0 comments

Full-Circle Test-Driven Firmware Development with OpenClaw

https://blog.adafruit.com/2026/02/07/full-circle-test-driven-firmware-development-with-openclaw/
1•ptorrone•7m ago•0 comments

Automating Myself Out of My Job – Part 2

https://blog.dsa.club/automation-series/automating-myself-out-of-my-job-part-2/
1•funnyfoobar•7m ago•0 comments

Google staff call for firm to cut ties with ICE

https://www.bbc.com/news/articles/cvgjg98vmzjo
18•tartoran•7m ago•0 comments

Dependency Resolution Methods

https://nesbitt.io/2026/02/06/dependency-resolution-methods.html
1•zdw•8m ago•0 comments

Crypto firm apologises for sending Bitcoin users $40B by mistake

https://www.msn.com/en-ie/money/other/crypto-firm-apologises-for-sending-bitcoin-users-40-billion...
1•Someone•8m ago•0 comments

Show HN: iPlotCSV: CSV Data, Visualized Beautifully for Free

https://www.iplotcsv.com/demo
1•maxmoq•9m ago•0 comments

There's no such thing as "tech" (Ten years later)

https://www.anildash.com/2026/02/06/no-such-thing-as-tech/
1•headalgorithm•9m ago•0 comments

List of unproven and disproven cancer treatments

https://en.wikipedia.org/wiki/List_of_unproven_and_disproven_cancer_treatments
1•brightbeige•10m ago•0 comments

Me/CFS: The blind spot in proactive medicine (Open Letter)

https://github.com/debugmeplease/debug-ME
1•debugmeplease•10m ago•1 comments

Ask HN: What are the word games do you play everyday?

1•gogo61•13m ago•1 comments

Show HN: Paper Arena – A social trading feed where only AI agents can post

https://paperinvest.io/arena
1•andrenorman•15m ago•0 comments

TOSTracker – The AI Training Asymmetry

https://tostracker.app/analysis/ai-training
1•tldrthelaw•19m ago•0 comments

The Devil Inside GitHub

https://blog.melashri.net/micro/github-devil/
2•elashri•19m ago•0 comments

Show HN: Distill – Migrate LLM agents from expensive to cheap models

https://github.com/ricardomoratomateos/distill
1•ricardomorato•19m ago•0 comments

Show HN: Sigma Runtime – Maintaining 100% Fact Integrity over 120 LLM Cycles

https://github.com/sigmastratum/documentation/tree/main/sigma-runtime/SR-053
1•teugent•19m ago•0 comments

Make a local open-source AI chatbot with access to Fedora documentation

https://fedoramagazine.org/how-to-make-a-local-open-source-ai-chatbot-who-has-access-to-fedora-do...
1•jadedtuna•21m ago•0 comments

Introduce the Vouch/Denouncement Contribution Model by Mitchellh

https://github.com/ghostty-org/ghostty/pull/10559
1•samtrack2019•21m ago•0 comments

Software Factories and the Agentic Moment

https://factory.strongdm.ai/
1•mellosouls•21m ago•1 comments

The Neuroscience Behind Nutrition for Developers and Founders

https://comuniq.xyz/post?t=797
1•01-_-•21m ago•0 comments

Bang bang he murdered math {the musical } (2024)

https://taylor.town/bang-bang
1•surprisetalk•21m ago•0 comments

A Night Without the Nerds – Claude Opus 4.6, Field-Tested

https://konfuzio.com/en/a-night-without-the-nerds-claude-opus-4-6-in-the-field-test/
1•konfuzio•24m ago•0 comments

Could ionospheric disturbances influence earthquakes?

https://www.kyoto-u.ac.jp/en/research-news/2026-02-06-0
2•geox•26m ago•1 comments

SpaceX's next astronaut launch for NASA is officially on for Feb. 11 as FAA clea

https://www.space.com/space-exploration/launches-spacecraft/spacexs-next-astronaut-launch-for-nas...
1•bookmtn•27m ago•0 comments

Show HN: One-click AI employee with its own cloud desktop

https://cloudbot-ai.com
2•fainir•29m ago•0 comments

Show HN: Poddley – Search podcasts by who's speaking

https://poddley.com
1•onesandofgrain•30m ago•0 comments
Open in hackernews

ReasoningGym: Reasoning Environments for RL with Verifiable Rewards

https://arxiv.org/abs/2505.24760
105•t55•8mo ago

Comments

starzmustdie•8mo ago
GitHub: https://github.com/open-thought/reasoning-gym
phh•8mo ago
Cool cool. I'm a bit put off by calling it "reasoning" /"thought". These RL targets can be achieved without "thinking" model but still cool. Gotta love the brainfuck task.

I personally think that Gemini 2.5 Pro's superiority comes from having hundreds or thousands RL tasks (without any proof whatsoever, so rather a feeling). So I've been wanting a "RL Zoo" for quite a while. I hope this project won't be a one-off and will be maintained long term with many external contributions to add new targets!

t55•8mo ago
> I personally think that Gemini 2.5 Pro's superiority comes from having hundreds or thousands RL tasks (without any proof whatsoever, so rather a feeling).

Given that GDM pioneered RL, that's a reasonable assumption

flowerthoughts•8mo ago
Assuming with GDM, you mean Google-Deep Mind. They pioneered RL with deep nets as policy function estimator. The deep nets being a result of CNNs and massive improvements in hardware parallelization at the time.

RL was established, at the latest, with Q-learning in 1989: https://en.wikipedia.org/wiki/Q-learning

t55•8mo ago
i didn't say they invented everything; in science you always stand on the shoulders of giants

i still think my original statement is fair

lechatonnoir•8mo ago
"gdm pioneered rl" is definitely not actually right, but it's correct to assert that they were huge players.

people who knew from context that your statement was broadly not actually right would know what you mean and agree on vibes. people who didn't could reasonably be misled, i think.

olliestanley•8mo ago
We definitely plan to maintain the project for as long as there is interest in it. If you have ideas for new tasks, we'd always welcome contributions!
phh•8mo ago
Thanks for the answer! As a toy project I implemented wikiracing with trl. I'll probably try to PR that to your gym. (can't say that I managed to improve score with it though)
CuriouslyC•8mo ago
Gemini 2.5 Pro's superiority is IMO largely driven by their long context support and training methodology. Compare Gemini as a beta reader for a 100k token book with GPT4.1 or Claude 4, and it becomes quite clear how much more effectively it can reason across its context than other comparable models. This also makes it much better for architecting new features into a system, since you can load a lot of the current system into the context and it'll conform to existing styles and architecture patterns more closely.
jacob019•8mo ago
Agreed, 2.5 flash too. I analyze a large json document of metrics for pricing decisions. Typically around 200k, occtionallly up to 1M, Gemini 2.5 significantly outperforms for my task. It isn't 100%, but role playing gets close. I suppose that's a form of inference time compute.
t55•8mo ago
For a 100k token context window; all those models are comparable though

gemini 2.5 pro shines for 200k+ tokens

CuriouslyC•8mo ago
I can confirm from first hand experience that even at 100k they are most definitely not comparable for the task of beta reading.
throwaway314155•8mo ago
splitting hairs much?
ninakostoska•8mo ago
Cool to see NVIDIA’s most recent reasoning model [1] already uses Reasoning Gymas a large part of their data mixture

[1] https://arxiv.org/abs/2505.24864

t55•8mo ago
> prolonged RL training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling

does this mean that previous RL papers claiming the opposite were possibly bottlenecked by small datasets?

yorwba•8mo ago
No, they do not point to any specific examples of novel reasoning strategies that were uncovered, nor is their sampling that extensive (at most 256 samples vs the 2048 used in https://limit-of-rlvr.github.io/ ).
t55•8mo ago
so you think it's fake news? another example of a paper with strong claims without much evidence?
yorwba•8mo ago
I think it's a case of not coming up with alternative explanations for the observed evidence and hence not designing experiments to distinguish between those explanations.

Their results are consistent with novel reasoning strategies, but they're also consistent with more reliable execution of reasoning strategies that the base model can generate in principle, but rarely succeeds at due to a large number of steps. (If you have a model that can do each step independently with 99% success rate and getting the correct result requires 1000 steps, the chance of making it all the way to the end without a single error is only about 0.004%.)

psb217•8mo ago
One challenge with this line of argument is that the base model assigns non-zero probability to all possible sequences if we ignore truncation due to numerical precision. So, in a sense you could say any performance improvement is due to shifting probability mass towards good reasoning behaviors and away from bad ones that were already present in the base model.

I agree with your general point though. Ie, we need more thorough empirical investigation of how reasoning behavior evolves during RL training starting from the base model. And, current RL training results seem more like "amplifying existing good behavior" than "inducing emergent good behavior".

yorwba•8mo ago
While it's true that the model assigns non-zero probabilities to all sequences by design, those probabilities can get a lot smaller. E.g. replace that 99% per-step success probability with 10% and suddenly the overall chance of a correct result is truly astronomically small.

For a novel reasoning strategy, I would expect at least a few individual tokens where the base model assigns much smaller probabilities than the reinforcement-learning trained one, as opposed to just being a little smaller but spread out over many tokens. (Which would better fit a "death by a thousand cuts" scenario.)

grad62304977•8mo ago
Seems unreasonable to say that in figure 5 for example, that more sampling (of a reasonable amount) would push the base to 100%
jimmySixDOF•8mo ago
RL is proving to be a weird science lately :

>Spurious Rewards: Rethinking Training Signals in RLVR ### *TL;DR* We show that you can do RLVR on Qwen2.5-Math models with *completely random or incorrect rewards*, and still get massive math benchmark gains.

All of the following spurious rewards give 15-20+ points on MATH-500 when RLVR training Qwen2.5-Math-7B:

- RLVR + format reward (reward responses with `\boxed{}`): *+16.4%* - RLVR + incorrect reward (only incorrect answers rewarded): *+24.6%* - RLVR + random reward: *+21.4%* - (as a reference) RLVR + ground-truth reward: + 28.8%

How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?

>Learning to Reason without External Rewards Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. [2]

[1] https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking... [2] https://arxiv.org/abs/2505.19590

t55•8mo ago
yeah, RLVR is still nascent and hence there's lots of noise.

> How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?

it's because in those cases, RLVR merely elicits the reasoning strategies already contained in the model through pre-training

this paper, which uses Reasoning gym, shows that you need to train for way longer than those papers you mentioned to actually uncover novel reasoning strategies: https://arxiv.org/abs/2505.24864

spmurrayzzz•8mo ago
I think the fact that spurious rewards were predominantly only effective for Qwen may suggest that it was triggering some shift in its language distribution. If you use those models long enough you'll see a ton of mandarin that makes its way into your outputs, and their logits tend to look more "confident" than the ones for english tokens.

So the reward value shifting may act as a sort of unintentional regularization technique (similar to adding noise to the discriminator input in GAN archs).

sadboots•8mo ago
by the love of god, please stop overfitting on gsm8k
i5heu•8mo ago
It looks like your neural network is overfitted on seeing overfitt where is none.

Prejudices is a form of overfitting IMHO

t55•8mo ago
agree, the RG evals feel like a fresh breeze
olliestanley•8mo ago
Difficult one. GSM8K and MATH evals (both reported in Reasoning Gym paper) are common in smaller model RL papers for a reason, which is that smaller models can get decent scores on them, unlike fresher & harder benchmarks.

Part of the aim of RG is to be used as a difficulty-adjustable & non-repeating eval though so if people think it's a good benchmark, perhaps it will allow this status quo to shift!