frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

IQuest-Coder: A new open-source code model beats Claude Sonnet 4.5 and GPT 5.1 [pdf]

https://github.com/IQuestLab/IQuest-Coder-V1/blob/main/papers/IQuest_Coder_Technical_Report.pdf
35•shenli3514•3h ago

Comments

adastra22•1h ago
A 40B weight model that beats Sonnet 4.5 and GPT 5.1? Can someone explain this to me?
cadamsdotcom•1h ago
My suspicion (unconfirmed so take it with a grain of salt) is they either used some/all test data to train, or there was some leakage from the benchmark set into their training set.

That said Sonnet 4.5 isn’t new and there have been loads of innovations recently.

Exciting to see open models nipping at the heels of the big end of town. Let’s see what shakes out over the coming days.

pertymcpert•1h ago
None of these open source models actually can compete with Sonnet when it comes to real life usage. They're all benchmaxxed so in reality they're not "nipping at the heels". Which is a shame.
stingraycharles•1h ago
It’s a shame but it’s also understandable that they cannot compete with SOTA models like Sonnet and Opus.

They’re focused almost entirely on benchmarks. I think Grok is doing the same thing. I wonder if people could figure out a type of benchmark that cannot be optimized for, like having multiple models compete against each other in something.

NitpickLawyer•54m ago
swe-rebench is a pretty good indicator. They take "new" tasks every month and test the models on those. For the open models it's a good indicator of task performance since the tasks are collected after the models are released. A bit tricky on evaluating API based models, but it's the best concept yet.
behnamoh•8m ago
IQuest stands for it's questionable
sabareesh•1h ago
TL;DR is that they didn't clean the repo (.git/ folder), model just reward hacked its way to look up future commits with fixes. Credit goes to everyone in this thread for solving this: https://xcancel.com/xeophon/status/2006969664346501589

(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)

https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...

ofirpress•1h ago
As John says in that thread, we've fixed this issue in SWE-bench: https://xcancel.com/jyangballin/status/2006987724637757670

If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images

LiamPowell•50m ago
> I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking

I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.

brunooliv•53m ago
GLM-4.7 in opencode is the only opensource one that comes close in my experience and probably they did use some Claude data as I see the occasional You’re absolutely right in there
kees99•18m ago
Do you see "What's your use-case" too?

Claude spits that very regularly at the end of the answer, when it's clearly out of it's depth, and wants to steer discussion away from that blind-spot.

behnamoh•9m ago
it's not even close to sonnet 4.5, let alone opus.
simonw•44m ago
Has anyone run this yet, either on their own machine or via a hosted API somewhere?
denysvitali•21m ago
Better link: https://iquestlab.github.io/

But yes, sadly it looks like the agent cheated during the eval

The Internet's Last Bottleneck

https://replicated.wiki/blog/args.html
1•gritzko•7m ago•0 comments

At least 7 explosions and low-flying aircraft are heard in Venezuela's Caracas

https://apnews.com/article/venezuela-us-explosions-caracas-ca712a67aaefc30b1831f5bf0b50665e
2•mudil•8m ago•0 comments

I built cloud transcoding 50x cheaper than AWS/Coconut/others – anyone need it?

1•souvlakee•10m ago•0 comments

Programmatic Tool Calling for Agents

https://github.com/zeke-john/codecall
1•zekejohn•12m ago•1 comments

Crypto users forced to share account details with tax officials

https://www.bbc.com/news/articles/ckgl2je65klo
1•chistev•12m ago•0 comments

Rkgk UI — A low latency Digital Art software on the browser

https://github.com/michael-0acf4/rkgk
1•michael-0acf4•15m ago•0 comments

Vaping safer than smoking – so why are people struggling to quit e-cigarettes?

https://www.theguardian.com/society/2026/jan/03/vaping-safer-than-smoking-so-why-are-people-strug...
1•beardyw•16m ago•0 comments

Visualise a Pcap with Zeek and Grafana

https://hackertarget.com/zeek-dashboard-grafana/
1•the_wanderer•20m ago•0 comments

Show HN: Griblet – Free browser-based GRIB file viewer

https://griblet.app
1•itsangaris•20m ago•0 comments

Ask HN: Where can I buy uninhabited islands in the Seychelles or Caribbean?

1•roschdal•21m ago•0 comments

Formal Methods Only Solve Half My Problems

https://brooker.co.za/blog/2022/06/02/formal.html
2•signa11•28m ago•0 comments

Multiple explosions in Venezuela's capital Caracas

https://www.cnn.com/2026/01/03/americas/venezuela-explosions-intl-hnk
31•breppp•28m ago•11 comments

The Case for Blogging in the Ruins

https://www.joanwestenberg.com/the-case-for-blogging-in-the-ruins/
2•ingve•31m ago•0 comments

"Compass nerd" creates new damped compass, gives to public domain

https://www.youtube.com/watch?v=eiDhbZ8-BZI
2•tomcam•32m ago•0 comments

Google Skills

https://www.skills.google/
2•modinfo•32m ago•0 comments

Ask HN: Expository/Succinct Books on Modern Physics

1•rramadass•42m ago•0 comments

Loud noises heard in Venezuela capital, southern area without electricity

https://www.reuters.com/world/americas/loud-noises-heard-venezuela-capital-southern-area-without-...
20•jumpocelot•42m ago•4 comments

"Nobody should stop these experiments from happening"

https://english.elpais.com/climate/2025-12-31/david-king-chemist-there-are-scientists-studying-ho...
2•dr_dshiv•43m ago•0 comments

Norwich firm creates 'virus cocktails' to help farmers

https://www.edp24.co.uk/news/25712243.norwich-firm-creates-virus-cocktails-help-farmers/
1•austinallegro•45m ago•0 comments

Show HN: Image to PDF – client-side image to PDF converter

https://github.com/ivanglpz/v0-image-pdf
2•ivanglpz•47m ago•0 comments

Uber rewrites contracts with drivers to avoid paying UK's new 'taxi tax'

https://www.theguardian.com/technology/2026/jan/02/uber-avoids-new-uk-taxi-tax-rewriting-driver-c...
3•Brajeshwar•48m ago•0 comments

I got 500 devs to try my AI app builder in a week (solo founder, no ads)

https://www.indiehackers.com/post/how-i-got-500-devs-to-try-my-ai-app-builder-in-a-week-solo-foun...
1•genvibe•48m ago•0 comments

Brainradio.fm

https://brainradio.fm/
1•gusk•52m ago•0 comments

Show HN: Underserved Directory Niches (Hand-picked for 2026)

https://directoryideas.ai/underserved-niches
1•tejas3732•1h ago•0 comments

America's toughest privacy protections have kicked in

https://www.washingtonpost.com/technology/2026/01/02/delete-information-data-brokers-california/
1•1vuio0pswjnm7•1h ago•0 comments

Benchmarking Windows Against Itself, from Windows XP to Windows 11

https://hackaday.com/2026/01/02/benchmarking-windows-against-itself-from-windows-xp-to-windows-11/
5•anonymousiam•1h ago•0 comments

Math of Vector graphics on GPU, inspired by piet-GPU

https://gasiulis.name/vector-graphics-on-gpu/
1•gsf_emergency_6•1h ago•0 comments

China's BYD overtakes Tesla as top EV seller

https://www.cnbc.com/2026/01/02/chinas-byd-to-overtake-tesla-as-worlds-top-ev-seller-for-first-ti...
3•1vuio0pswjnm7•1h ago•0 comments

The Constitution Was a Coup

https://lucasvance.github.io/2100/coup/
3•sirponm•1h ago•0 comments

A Go question: how do you test select based code?

https://utcc.utoronto.ca/~cks/space/blog/programming/GoTestingSelectQuestion
3•Twirrim•1h ago•0 comments