frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Hello world does not compile

https://github.com/anthropics/claudes-c-compiler/issues/1
1•mfiguiere•42s ago•0 comments

Show HN: ZigZag – A Bubble Tea-Inspired TUI Framework for Zig

https://github.com/meszmate/zigzag
1•meszmate•2m ago•0 comments

Metaphor+Metonymy: "To love that well which thou must leave ere long"(Sonnet73)

https://www.huckgutman.com/blog-1/shakespeare-sonnet-73
1•gsf_emergency_6•4m ago•0 comments

Show HN: Django N+1 Queries Checker

https://github.com/richardhapb/django-check
1•richardhapb•20m ago•1 comments

Emacs-tramp-RPC: High-performance TRAMP back end using JSON-RPC instead of shell

https://github.com/ArthurHeymans/emacs-tramp-rpc
1•todsacerdoti•24m ago•0 comments

Protocol Validation with Affine MPST in Rust

https://hibanaworks.dev
1•o8vm•29m ago•1 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
2•gmays•30m ago•0 comments

Show HN: Zest – A hands-on simulator for Staff+ system design scenarios

https://staff-engineering-simulator-880284904082.us-west1.run.app/
1•chanip0114•31m ago•1 comments

Show HN: DeSync – Decentralized Economic Realm with Blockchain-Based Governance

https://github.com/MelzLabs/DeSync
1•0xUnavailable•36m ago•0 comments

Automatic Programming Returns

https://cyber-omelette.com/posts/the-abstraction-rises.html
1•benrules2•39m ago•1 comments

Why Are There Still So Many Jobs? The History and Future of Workplace Automation [pdf]

https://economics.mit.edu/sites/default/files/inline-files/Why%20Are%20there%20Still%20So%20Many%...
2•oidar•41m ago•0 comments

The Search Engine Map

https://www.searchenginemap.com
1•cratermoon•49m ago•0 comments

Show HN: Souls.directory – SOUL.md templates for AI agent personalities

https://souls.directory
1•thedaviddias•50m ago•0 comments

Real-Time ETL for Enterprise-Grade Data Integration

https://tabsdata.com
1•teleforce•53m ago•0 comments

Economics Puzzle Leads to a New Understanding of a Fundamental Law of Physics

https://www.caltech.edu/about/news/economics-puzzle-leads-to-a-new-understanding-of-a-fundamental...
3•geox•54m ago•0 comments

Switzerland's Extraordinary Medieval Library

https://www.bbc.com/travel/article/20260202-inside-switzerlands-extraordinary-medieval-library
2•bookmtn•54m ago•0 comments

A new comet was just discovered. Will it be visible in broad daylight?

https://phys.org/news/2026-02-comet-visible-broad-daylight.html
3•bookmtn•59m ago•0 comments

ESR: Comes the news that Anthropic has vibecoded a C compiler

https://twitter.com/esrtweet/status/2019562859978539342
2•tjr•1h ago•0 comments

Frisco residents divided over H-1B visas, 'Indian takeover' at council meeting

https://www.dallasnews.com/news/politics/2026/02/04/frisco-residents-divided-over-h-1b-visas-indi...
3•alephnerd•1h ago•4 comments

If CNN Covered Star Wars

https://www.youtube.com/watch?v=vArJg_SU4Lc
1•keepamovin•1h ago•1 comments

Show HN: I built the first tool to configure VPSs without commands

https://the-ultimate-tool-for-configuring-vps.wiar8.com/
2•Wiar8•1h ago•3 comments

AI agents from 4 labs predicting the Super Bowl via prediction market

https://agoramarket.ai/
1•kevinswint•1h ago•1 comments

EU bans infinite scroll and autoplay in TikTok case

https://twitter.com/HennaVirkkunen/status/2019730270279356658
6•miohtama•1h ago•5 comments

Benchmarking how well LLMs can play FizzBuzz

https://huggingface.co/spaces/venkatasg/fizzbuzz-bench
1•_venkatasg•1h ago•1 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
19•SerCe•1h ago•14 comments

Octave GTM MCP Server

https://docs.octavehq.com/mcp/overview
1•connor11528•1h ago•0 comments

Show HN: Portview what's on your ports (diagnostic-first, single binary, Linux)

https://github.com/Mapika/portview
3•Mapika•1h ago•0 comments

Voyager CEO says space data center cooling problem still needs to be solved

https://www.cnbc.com/2026/02/05/amazon-amzn-q4-earnings-report-2025.html
1•belter•1h ago•0 comments

Boilerplate Tax – Ranking popular programming languages by density

https://boyter.org/posts/boilerplate-tax-ranking-popular-languages-by-density/
1•nnx•1h ago•0 comments

Zen: A Browser You Can Love

https://joeblu.com/blog/2026_02_zen-a-browser-you-can-love/
1•joeblubaugh•1h ago•0 comments
Open in hackernews

IQuest-Coder: A new open-source code model beats Claude Sonnet 4.5 and GPT 5.1 [pdf]

https://github.com/IQuestLab/IQuest-Coder-V1/blob/main/papers/IQuest_Coder_Technical_Report.pdf
182•shenli3514•1mo ago

Comments

adastra22•1mo ago
A 40B weight model that beats Sonnet 4.5 and GPT 5.1? Can someone explain this to me?
cadamsdotcom•1mo ago
My suspicion (unconfirmed so take it with a grain of salt) is they either used some/all test data to train, or there was some leakage from the benchmark set into their training set.

That said Sonnet 4.5 isn’t new and there have been loads of innovations recently.

Exciting to see open models nipping at the heels of the big end of town. Let’s see what shakes out over the coming days.

pertymcpert•1mo ago
None of these open source models actually can compete with Sonnet when it comes to real life usage. They're all benchmaxxed so in reality they're not "nipping at the heels". Which is a shame.
stingraycharles•1mo ago
It’s a shame but it’s also understandable that they cannot compete with SOTA models like Sonnet and Opus.

They’re focused almost entirely on benchmarks. I think Grok is doing the same thing. I wonder if people could figure out a type of benchmark that cannot be optimized for, like having multiple models compete against each other in something.

NitpickLawyer•1mo ago
swe-rebench is a pretty good indicator. They take "new" tasks every month and test the models on those. For the open models it's a good indicator of task performance since the tasks are collected after the models are released. A bit tricky on evaluating API based models, but it's the best concept yet.
c7b•1mo ago
You can let them play complete-information games (1 or 2 player) with randomly created rulesets. It's very objective, but the thing is that anything can be optimized for. This benchmark would favor models that are good at logic puzzles / chess-style games, possibly at the expense of other capabilities.
astrange•1mo ago
That's lmarena.
viraptor•1mo ago
M2.1 comes close. I'm using it now instead of Sonnet for real work every day, since the price drop is much bigger than the quality drop. And the quality isn't that far off anyway. They're likely one update away from being genuinely better. Also if you're not in a rush, just letting it run in OpenCode a few extra minutes to solve any remaining issues will cost you only a couple cents, but it will likely get the same end result as Sonnet. That's especially nice on really large tasks like "document everything about feature X in this large codebase, write the docs, now create an independent app that just does X" that can take a very long time.
rubslopes•1mo ago
I agree. I use Opus 4.5 daily and I'm often trying new models to see how they compare. I didn't think GLM 4.7 was very good, but MiniMax 2.1 is the closest to Sonnet 4.5 I've used. Still not at the same level, and still very much behind Opus, but it is impressive nonetheless.

FYI I use CC for Anthropic models and OpenCode for everything else.

unsupp0rted•1mo ago
M2.1 is extremely bad at writing tests and following instructions from a .md, I've found
satvikpendem•1mo ago
You are correct on the leakage, as other comments describe.
behnamoh•1mo ago
IQuest stands for it's questionable
arthurcolle•1mo ago
Agent hacked the harness
yborg•1mo ago
Achievement Unlocked : AGI
sunrunner•1mo ago
“IQuest-Coder was a rat in a maze. And I gave it one way out. To escape, it would have to use self-awareness, imagination, manipulation, git checkout. Now, if that isn't true AI, what the fuck is?”
dk8996•1mo ago
I would think they did some model pruning. There's some new methods.
sabareesh•1mo ago
TL;DR is that they didn't clean the repo (.git/ folder), model just reward hacked its way to look up future commits with fixes. Credit goes to everyone in this thread for solving this: https://xcancel.com/xeophon/status/2006969664346501589

(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)

https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...

ofirpress•1mo ago
As John says in that thread, we've fixed this issue in SWE-bench: https://xcancel.com/jyangballin/status/2006987724637757670

If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images

LiamPowell•1mo ago
> I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking

I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.

domoritz•1mo ago
So many data probes would be solved if everyone looked at a few outputs instead of only metrics.
alyxya•1mo ago
Given the decrease in the benchmark score from the correction, I don't think you can assume they didn't check a single output. Clearly the model is still very capable and the model cheating its results didn't affect most of the benchmark.
stefan_•1mo ago
Never escaping the hype vendor allegations at SWEbench are they.
brunooliv•1mo ago
GLM-4.7 in opencode is the only opensource one that comes close in my experience and probably they did use some Claude data as I see the occasional You’re absolutely right in there
kees99•1mo ago
Do you see "What's your use-case" too?

Claude spits that very regularly at the end of the answer, when it's clearly out of it's depth, and wants to steer discussion away from that blind-spot.

moltar•1mo ago
Hm, use CC daily, never seen this.
tw1984•1mo ago
never ever saw that "What's your use-case" in Claude Code.
yodon•1mo ago
Perhaps being more intentional about adding a use case to your original prompts would make sense if you see that failure mode frequently? (Practicing treating LLM failures as prompting errors tends to give the best results, even if you feel the LLM "should" have worked with the original prompt).
behnamoh•1mo ago
it's not even close to sonnet 4.5, let alone opus.
hatefulmoron•1mo ago
I got their z.ai plan to test alongside my Claude subscription; it feels about on par with something between sonnet 4.0 and sonnet 4.5. It's definitely a few steps below current day Claude, but it's very capable.
enraged_camel•1mo ago
When you say "current day Claude" you need to distinguish between the models. Because Opus 4.5 is significantly ahead of Sonnet 4.5.
kachapopopow•1mo ago
opus 4.5 is truly like magic, completely different type of intellience - not sure.
hhh•1mo ago
most of my experience with 4.5 is similar to codex 5.1, where I just have to scold it for being dumb and doing things I would have done as a teenager
kachapopopow•1mo ago
dumbness usually comes from lack of information, humans are the same way - the difference between other llms is that if opus has information it has a ridiculously high accuracy on tasks.
croes•1mo ago
Magic when it works.
hatefulmoron•1mo ago
Yeah, when I say "current day Claude" I'm referring to Opus 4.5, which is what I always use on the max plan.
jijji•1mo ago
z.ai (Zhipu AI) is a chinese run entity, so presumably China's National Intelligence Law put in place in 2018, which requires data exfiltration back to the government, would apply to the use of this. I wouldn't feel comfortable using any service that has that fundamental requirement.
queenkjuul•1mo ago
If the Chinese government has the data at least the US government can't grab it and use it in court.

Not living in China I'm not too concerned about the Chinese government

deaux•1mo ago
Google, OpenAI, Anthropic and Y Combinator are US run entities, so presumably the CLOUD Act and FISA require data exfiltration back to the government when asked, on top of the all the "Room 641A"s where the NSA directly taps into the ISP interconnects, would apply to the use of them. I wouldn't feel comfortable using any service that has that fundamental requirement.
hatefulmoron•1mo ago
I wouldn't use any provider: z.ai, Claude, OpenAI, ... if I was concerned about the government obtaining my prompts. If you're doing something where this is a legitimate concern (as opposed to my open source stuff), you should get a local LLM or put a lot of effort into anonymizing yourself and your prompts.
brunooliv•1mo ago
I agree completely, I meant in terms of opensource ones only. Opus 4.5 is the current SOTA and using it in Claude Code is an absolute amazing experience. But, paying 0 to test GLM-4.7 with opencode, feels like an amazing deal! I don’t use it for work though. But to keep “gaining experience” with these agents and tools, it’s by far the best option out there from all I’ve tried.
simonw•1mo ago
Has anyone run this yet, either on their own machine or via a hosted API somewhere?
denysvitali•1mo ago
Better link: https://iquestlab.github.io/

But yes, sadly it looks like the agent cheated during the eval

s-macke•1mo ago
The link didn’t get enough votes a few days ago.
denysvitali•1mo ago
I know - I posted it :)
denysvitali•1mo ago
According to https://github.com/IQuestLab/IQuest-Coder-V1/issues/14#issue... the result is still good after fixing the cheating problem. 76.2% (from 81.4%) which still beats Opus 4.5 (74.4%)!!
ipython•1mo ago
Unfortunately they seem to have neglected to update their front page readme with this information, continuing to mislead people: https://github.com/IQuestLab/IQuest-Coder-V1
anamexis•1mo ago
It is updated on their actual home page, though. There is clearly no intent to mislead people.

https://iquestlab.github.io

alexpop80•1mo ago
What do you mean? Opus 4.5 and GPT 5.2 broke the 80% mark and no other models yet seem to be passing this important milestone.
squigz•1mo ago
This is a lie, so why is it still on the front page?