frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

https://arxiv.org/abs/2606.16140
46•timhigins•2h ago

Comments

aero2146•1h ago
I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...
realitysballs•1h ago
That’s all I needed to hear
pylotlight•36m ago
As in, you learnt that a useless test that no one should be using was tested here, that's what you meant right?
physPop•1h ago
Its for reasoning not generating art?
websap•1h ago
Can you explain this a bit more
tyre•41m ago
Imagine you want to make a smaller model that is really good at one thing, say, driving a car. You could remove the parameters that lead it to correctly answer, "What is the powerhouse of the cell?" or, "Who was the first president of the United States?"

It would look really dumb if someone asked it that, but that's fine. You're trying to make a model that is optimized for efficiency for a specific task. As much as possible, you should prune uncorrelated things.

pylotlight•35m ago
SVG generation is a useless test, what's there more to know?
steve_adams_86•14m ago
What if you're reasoning about how to generate SVG correctly?
fwipsy•49m ago
I think this is predicted? Part of the story is how they were able to preserve core reasoning ability while cutting knowledge like "pelicans have wings."

> these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios.

pylotlight•36m ago
The only real essential item here is tool calling capability is it not? So I assume they tested a strong read/write/edit tool consistency?
nsingh2•12m ago
This model doesn't support tool calling, was not part of its training. It's focused on Python (and I think C++) competitive programming and mathematics tasks, i.e. tasks with verifiable rewards. So if you have a task that fits that description, the size-to-capability ratio is good.
btown•11m ago
I'm not seeing any mention of tools in the paper, much less a bias towards "curiosity" to use those tools when it encounters gaps in its knowledge. So perhaps this is a good proof-of-concept that single-pass code generation is viable with this small a model - but we're still a long way from a viable solution.
noperator•54m ago
Having some success while testing this model out as a replacement for GPT-5 nano in source code security review. Running on RTX 3090 (24 GB VRAM) via vLLM. It's not great on structured output (as noted in the model card) but I'm working around that in my harness.
dummydummy1234•31m ago
Can't you just force it to do structured output via constrained generation?
gslepak•32m ago
Note that these is Python-only results, the model will not do as well with other languages.

I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.

deftio•19m ago
There is some base level of intelligence any model needs to be useful, even in narrow tasks.

Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...

Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.

SwellJoe•14m ago
It's terrible at hunting security bugs (I expected it to be, but I wanted to be sure). I added it to a benchmark I made with a corpus of some Mythos-discovered bugs, and it found zero. The smallest pretty successful models remain Qwen 3.6 and Gemma 4 (but I haven't tested the very small variants of those yet).

https://swelljoe.com/post/will-it-mythos/

nsingh2•3m ago
The lack of tool use will hinder it a lot I think, since bug hunting requires collecting context across a code base and stitching it together. It might be good in a more narrow sense, i.e "is there a bug in this block of code" and not considering how it interacts with the rest of the code base.

Steam Machine launches today

https://store.steampowered.com/news/group/45479024/view/685257114654870245
1345•theschwa•11h ago•1210 comments

GLM-5.2 – How to Run Locally

https://unsloth.ai/docs/models/glm-5.2
252•TechTechTech•7h ago•111 comments

In praise of memcached

https://jchri.st/blog/in-praise-of-memcached/
78•j03b•3h ago•27 comments

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

https://arxiv.org/abs/2606.16140
46•timhigins•2h ago•18 comments

Polymarket has flooded social media with deceptive videos by paid creators

https://www.wsj.com/business/media/polymarket-social-media-bets-prediction-market-441cdeb5?st=HhTZY2
88•Vaslo•2d ago•87 comments

An Introduction to YOLO26

https://blog.roboflow.com/yolo26/
19•teleforce•2h ago•0 comments

Cyberdecks, going analog, and convivial technology

https://blog.hydroponictrash.solar/cyberdecks-going-analog-and-convivial-technology/
67•akkartik•3d ago•30 comments

Optocam Zero: a Pi Zero based digital camera made using off the shelf components

https://github.com/dorukkumkumoglu/optocamzero
129•iamnothere•9h ago•31 comments

My Mathematical Regression

https://blog.dahl.dev/posts/my-mathematical-regression/
239•aleda145•3d ago•88 comments

Japanese symbols that speak without words

https://arun.is/blog/japan-symbols/
133•msephton•9h ago•56 comments

Windows NT for GameCube/Wii

https://github.com/Wack0/entii-for-workcubes
34•zdw•3d ago•6 comments

1,700 free online courses from top universities

https://www.openculture.com/freeonlinecourses
100•momentmaker•2h ago•20 comments

Giant Banana Pulled Over: Driver Says Cops Have Stopped Him 100s of Times

https://cowboystatedaily.com/2026/06/18/giant-banana-pulled-over-in-montana-driver-says-cops-have...
10•speckx•2d ago•1 comments

Moebius: 0.2B image inpainting model with 10B-level performance

https://hustvl.github.io/Moebius/
251•DSemba•14h ago•65 comments

Is it time for a new Embedded Linux build system?

https://yoebuild.org/blog/time-for-a-new-build-system/
51•cbrake•4d ago•36 comments

Canada plans 'nuclear renaissance' with up to 10 reactors built by 2040

https://www.cbc.ca/news/politics/federal-nuclear-strategy-9.7244509
374•geox•9h ago•227 comments

British Columbia, Time Zones, and Postgres

https://www.crunchydata.com/blog/british-columbia-and-time-zone-changes
124•sprawl_•9h ago•82 comments

Show HN: Oak – Git alternative designed for agents

https://oak.space/oak/oak
163•zdgeier•12h ago•154 comments

Package Managers need global hooks

https://captnemo.in/blog/2026/06/17/package-managers-need-hooks/
6•evakhoury•4d ago•1 comments

Kyber (YC W23) Is Hiring a Head of Engineering

https://www.ycombinator.com/companies/kyber/jobs/FGmI8mx-head-of-engineering
1•asontha•7h ago

Canyon HUD helmet for road riding

https://media-centre.canyon.com/en-INT/266866-new-canyon-heads-up-display-helmet-could-be-a-safet...
77•zh3•2d ago•91 comments

Flock-Powered Police Chiefs Stalking Women Shows Why Warrants Are Needed

https://ipvm.com/reports/police-chiefs-track
434•jhonovich•9h ago•174 comments

Show HN: Pagecast – Publish Markdown/HTML Reports to Cloudflare Pages

https://github.com/Amal-David/pagecast
37•amaldavid•4d ago•9 comments

ytr: YouTube Radio for Emacs

https://xenodium.com/ytr-youtube-radio-for-emacs
75•xenodium•7h ago•8 comments

Job application asked for my SAT scores

https://mrmarket.lol/job-application-asked-for-my-sat-scores/
108•seltzerboys•7h ago•268 comments

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

https://arxiv.org/abs/2606.03748
10•teleforce•1h ago•0 comments

Help I accidentally a wigglegram

https://lmao.center/blog/wiggle-accidents/
504•gregsadetsky•3d ago•120 comments

Prompt Injection as Role Confusion

https://role-confusion.github.io
166•x312•12h ago•89 comments

Show HN: Got sick of ads, so I made my own logic puzzle site

https://puzzlelair.com/
157•HaxleRose•16h ago•105 comments

SpaceX sheds $400B in market value as debut rally hits reverse

https://www.ft.com/content/c11d08ed-6668-4678-b829-1d50acbd12d4
48•simonpure•2h ago•39 comments