frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [pdf]

https://www.arxiv.org/pdf/2508.06471
169•SerCe•4h ago

Comments

ttul•3h ago
This feels like the first open model that doesn’t require significant caveats when comparing to frontier proprietary models. The parameter efficiency alone suggests some genuine innovations in training methodology. I am keen to see some independent verification of the results and to see how if does on Aider’s LLM Leaderboard.
lumost•3h ago
Why was qwen3 omitted from the coding benchmark but not other benchmarks?
coder543•2h ago
Section 4.3.2 includes Qwen3-Coder.
Reubend•3h ago
Fantastic release, and it's under the Apache license too. I'm so happy that we've got open source models pushing the envelope.
darknoon•3h ago
It's ok, somewhere between a qwen 2.5 VL and the frontier models (o3 / opus 4) on visual reasoning
reissbaker•2h ago
I've been playing around with GLM-4.5 as a coding model for a while now and it's really, really good. In the coding agent I've been working on, Octofriend [1], I've sometimes had it on and confused it for Claude 4. Subjectively, my experience has been:

1. Claude is somewhat better at whole-codebase tasks, where you need to reason over a bunch of context and consider system interactions.

2. GLM-4.5 is somewhat better at being "honest" — i.e. I rarely see it doing the things Claude does like making broken tests pass by changing the test instead of fixing the bug.

Both are quite good though, and GLM-4.5 has found bugs that both Claude 4 Sonnet and 4.1 Opus have failed to catch. In general I think Claude wins a little more frequently on debugging tasks than GLM-4.5, but it's close.

Compared to GPT-5, both Claude and GLM feel like they're more consistent, although GPT-5 sometimes has long brilliant runs where it nails everything with subjectively higher code quality than either of the latter. However, once GPT-5 goes off the rails, it's hard to get it back on track, so it can be a bit frustrating to work with in comparison.

1: https://github.com/synthetic-lab/octofriend

nico•2h ago
How are you using glm-4.5? Are you consuming the api or running something like glm-4.5 air locally?
reissbaker•2h ago
I run a privacy-focused inference company, Synthetic [1], and I use our API of course :P I actually like GLM-4.5 enough that it's currently our default recommended model for new users. But yes, otherwise I'd use the official zai API most likely, or Fireworks. GLM-4.5-Air is quite good for a local model but GLM-4.5 is better; up to you if the tradeoff is worth it — there's definitely value in the data not ever leaving your machine, but it's not going to be as strong of a model.

1: https://synthetic.new

azinman2•9m ago
I’m curious for your service, if it’s centered around privacy, why is the data stored for 14 days at all? My understanding with fireworks is that it’s 0 logging — nothing to store. To me that’s private.
sagarpatil•1h ago
Not OP. Chutes.ai charges $0.20 per 1M tokens. I don’t think it uses caching though because I ended up burning $30 in an hour or two. I had to move back to Claude Code.
esafak•48m ago
Caching makes price comparisons hard. Does anyone have tips?
UncleOxidant•2h ago
I just read your comment and decided to give GLM-4.5 a try in Kilocode. I'd been using Gemini CLI all day to try to resolve a tricky bug in some compiler code (a compiler for a subset of C that generates microcode for... a weird architecture, I'll leave it at that). So GLM-4.5 zoomed in on the problem right away. A problem that's eluded Gemini CLI all day. Gemini was leading me on a wild goose chase implicating a function that turns out wasn't the problem (and trying to make all kinds of lame changes to the function saying that would fix the problem - and it never did because the problem wasn't that function).
starchild3001•2h ago
Really appreciate the depth of this paper; it's a welcome change from the usual model announcement blog posts. The Zhipu/Tsinghua team laid out not just the 'what' but the 'how,' which is where the most interesting details are for anyone trying to build with or on top of these models.

The post-training methodology (Sec 3) is what really stands out to me. The idea of creating specialized 'expert models' for reasoning, agents, and chat, and then distilling their capabilities into a final unified model is a fascinating approach. It feels like a more structured way to solve the "jack of all trades, master of none" problem that can plague generalist models. Instead of just mixing all the data, they're essentially having a generalist learn from a committee of specialists.

A couple of the findings from their RL experiments are pure gold for anyone working in this space. The counter-intuitive result that a single-stage RL process at the full 64K context length outperforms a progressive, multi-stage approach (Fig 6) is a fantastic lesson. I've seen teams assume the opposite would be true. Also, the pragmatic choice to use an XML-like template for function calls to avoid JSON escaping hell (Fig 4) may be a small but brilliant engineering decision that makes a huge difference in practice. Wrangling escaped code inside JSON turns out to be a mess.

The performance on SWE-bench is impressive, putting it in the same league as much larger or proprietary models. What I’d love to see, and maybe others here have thoughts, is whether this hybrid training recipe holds up outside ARC-style evals. For example, do the agentic improvements transfer to messier, real-world workflows where APIs are undocumented, partial failures are common, and user input is full of ambiguity?

sagarpatil•1h ago
I’ve been using it and I think it’s on par with sonnet.
chvid•24m ago
This is a great model for software development - probably the best of the freely available ones.

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [pdf]

https://www.arxiv.org/pdf/2508.06471
170•SerCe•4h ago•16 comments

StarDict sends X11 clipboard to remote servers

https://lwn.net/SubscriberLink/1032732/3334850da49689e1/
36•pabs3•1h ago•12 comments

Wikipedia loses challenge against Online Safety Act

https://www.bbc.com/news/articles/cjr11qqvvwlo
712•phlummox•13h ago•537 comments

All known 49-year-old Apple-1 computer

https://www.apple1registry.com/en/list.html
54•elvis70•3d ago•9 comments

I tried every todo app and ended up with a .txt file

https://www.al3rez.com/todo-txt-journey
933•al3rez•16h ago•558 comments

Weathering Software Winter

https://100r.co/site/weathering_software_winter.html
34•todsacerdoti•2h ago•12 comments

The Article in the Most Languages

https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2025-08-09/Disinformation_report
26•vhcr•2d ago•3 comments

GitHub is no longer independent at Microsoft after CEO resignation

https://www.theverge.com/news/757461/microsoft-github-thomas-dohmke-resignation-coreai-team-transition
1087•Handy-Man•14h ago•793 comments

Show HN: I built an offline, open‑source desktop Pixel Art Editor in Python

https://github.com/danterolle/tilf
102•danterolle•7h ago•14 comments

Claude Code is all you need

https://dwyer.co.za/static/claude-code-is-all-you-need.html
576•sixhobbits•16h ago•311 comments

FreeBSD Scheduling on Hybrid CPUs

https://wiki.freebsd.org/Scheduler/Hybrid
44•fntlnz•4d ago•16 comments

OpenSSH Post-Quantum Cryptography

https://www.openssh.com/pq.html
384•throw0101d•18h ago•103 comments

Chris Simpkins, creator of Hack font, has died

https://typo.social/@Hilary/114845913381245488
63•laqq3•3h ago•5 comments

Show HN: Play Pokémon to unlock your Wayland session

https://github.com/AdoPi/wlgblock
88•anajimi•1d ago•36 comments

What does it mean to be thirsty?

https://www.quantamagazine.org/what-does-it-mean-to-be-thirsty-20250811/
46•pseudolus•7h ago•24 comments

Neki – sharded Postgres by the team behind Vitess

https://planetscale.com/blog/announcing-neki
181•thdxr•11h ago•26 comments

How to teach your kids to play poker: Start with one card

https://www.bloomberg.com/news/articles/2025-08-08/how-to-teach-your-kids-poker-with-one-card-at-age-four
67•ioblomov•3d ago•89 comments

Japan's largest paper, Yomiuri Shimbun, sues Perplexity for copyright violations

https://www.niemanlab.org/2025/08/japans-largest-newspaper-yomiuri-shimbun-sues-perplexity-for-copyright-violations/
110•aspenmayer•5h ago•40 comments

Launch HN: Halluminate (YC S25) – Simulating the internet to train computer use

54•wujerry2000•14h ago•39 comments

Ollama and gguf

https://github.com/ollama/ollama/issues/11714
124•indigodaddy•12h ago•50 comments

The value of institutional memory

https://timharford.com/2025/05/the-value-of-institutional-memory/
137•leoc•13h ago•74 comments

You're Wrong About Dates – and Your Code Is Lying to You

https://metaduck.com/youre-wrong-about-dates/
3•pgte•3d ago•1 comments

The History of Windows XP

https://www.abortretry.fail/p/the-history-of-windows-xp
50•achairapart•1d ago•28 comments

Why tail-recursive functions are loops

https://kmicinski.com/functional-programming/2025/08/01/loops/
88•speckx•3d ago•99 comments

Byte Buddy is a code generation and manipulation library for Java

https://bytebuddy.net/
78•mooreds•3d ago•27 comments

Starbucks in Korea asks customers to stop bringing in printers/desktop computers

https://fortune.com/2025/08/11/starbucks-south-korea-policy-desktop-computer-printer-ban-cagongjok/
22•zdw•7h ago•7 comments

How Boom uses software to accelerate hardware development

https://bscholl.substack.com/p/move-fast-and-dont-break-safety-critical
83•flabber•1d ago•66 comments

36B solar mass black hole at centre of the Cosmic Horseshoe gravitational lens

https://academic.oup.com/mnras/article/541/4/2853/8213862?login=false
131•bookofjoe•15h ago•92 comments

AOL to discontinue dial-up internet

https://www.nytimes.com/2025/08/11/business/aol-dial-up-internet.html
178•situationista•22h ago•182 comments

The Joy of Mixing Custom Elements, Web Components, and Markdown

https://deanebarker.net/tech/blog/custom-elements-markdown/
95•deanebarker•13h ago•34 comments