1. Claude is somewhat better at whole-codebase tasks, where you need to reason over a bunch of context and consider system interactions.
2. GLM-4.5 is somewhat better at being "honest" — i.e. I rarely see it doing the things Claude does like making broken tests pass by changing the test instead of fixing the bug.
Both are quite good though, and GLM-4.5 has found bugs that both Claude 4 Sonnet and 4.1 Opus have failed to catch. In general I think Claude wins a little more frequently on debugging tasks than GLM-4.5, but it's close.
Compared to GPT-5, both Claude and GLM feel like they're more consistent, although GPT-5 sometimes has long brilliant runs where it nails everything with subjectively higher code quality than either of the latter. However, once GPT-5 goes off the rails, it's hard to get it back on track, so it can be a bit frustrating to work with in comparison.
I think if you are striving for full privacy, you should implement the secure enclave idea presented by ollama, it makes the entire pipeline fully encrypted, I'm waiting for an actual provider to finally implement this.
Compared to using — for example — DeepSeek from deepseek.com, I think we're much more private. Even compared to using OpenAI and opting-out of your data being used for training, we're still more private, since OpenAI makes no guarantees for individuals that they don't store the data — notably, any data ever sent to them is apparently now being shared with New York courts (and the New York Times!) due to their ongoing legal battle with the New York Times [1]. And compared to using OpenRouter with "data_collection: deny", we uh, actually work :P Surprisingly sad how many broken model implementations there are if you're just round-robin-ing between inference companies... Especially reasoning models, and especially with tool-calling.
(And if something's broken, you can email us and we'll generally fix it; OpenRouter doesn't actually host any models themselves, so there's not much they can do if one isn't working well other than just de-list.)
1: https://arstechnica.com/tech-policy/2025/07/nyt-to-start-sea...
Also, your Privacy Policy is not currently EU GDPR compliant. ;-)
We have servers in the EU and US, but right now there's no way to route to only EU (or US) DCs.
Couple of things stood out:
- No nominated representative to act as local contact for data subjects and supervisory authorities. - A few standard clauses relating to what and where data is transferred / stored etc.
Nothing too strenuous, but I thought since you've gone to the effort of having EU servers I guess you're expecting EU customers and so hence probably worth tweaking your privacy policy.
I look forward to experimenting with GLM-4.5 through your service !
> we will not sell [personal information], except as follows: > - We work with business partners who support us.
Uhhm, that doesn't inspire a lot of confidence TBH!
These third-party service providers are prohibited from using personal information for any other purpose and are contractually required to comply with all applicable laws and requirements, which may include Payment Card Industry Data Security Standards if they are processing payments.
We use third parties like Stripe and Clerk, and by nature of using those services, your information is disclosed to them. This is an extremely common clause in privacy policies, and one we need to have unless we roll everything ourselves. We're much more private than using, say, OpenAI, DeepSeek, Anthropic, or most popular LLM services.
I'd recommend to rephrase your marketing, because not storing prompts is a huge selling point compared to deleting after 14 days (a lot of things can happen in 14 days)
So far, which model can find/solve what is really scattered all over the place.
GPT-5: Exceptional at abstract reasoning, planning and following the intention behind instructions. Concise and intentional. Not great at manipulating text or generating python code.
Gemini 2.5 Pro: Exceptional at manipulating text and python, not great at abstract reasoning. Verbose. Doesn't follow instructions well.
Another thing I've learned is that models work better when they work on code that they themselves generated. It's "in distribution" and more comprehensible to them.
My very first use of the llm-consortium saw me feeding in it's own source code to look for bugs. It surfaced a serious bug which only one out of the three models had spotted. Lots of problems are NP-ish so parallel sampling works really well. Googles IMO gold and openais IOI gold both used parallel reasoning of some sort.
Deepseek R1 (does high level planning) combined with Qwen3 480B (does low level coding) or whatever is available from qwen code apis.
It's working great.
It solves 99.99% problem on tis own.
The seperation isn't very good in aider so i later plan to make my own tool to achieve better workflow.
deepseek r1+ qwen3 is close enough along with gemini2.5 pro
so i don't see any point of claude anymore
The post-training methodology (Sec 3) is what really stands out to me. The idea of creating specialized 'expert models' for reasoning, agents, and chat, and then distilling their capabilities into a final unified model is a fascinating approach. It feels like a more structured way to solve the "jack of all trades, master of none" problem that can plague generalist models. Instead of just mixing all the data, they're essentially having a generalist learn from a committee of specialists.
A couple of the findings from their RL experiments are pure gold for anyone working in this space. The counter-intuitive result that a single-stage RL process at the full 64K context length outperforms a progressive, multi-stage approach (Fig 6) is a fantastic lesson. I've seen teams assume the opposite would be true. Also, the pragmatic choice to use an XML-like template for function calls to avoid JSON escaping hell (Fig 4) may be a small but brilliant engineering decision that makes a huge difference in practice. Wrangling escaped code inside JSON turns out to be a mess.
The performance on SWE-bench is impressive, putting it in the same league as much larger or proprietary models. What I’d love to see, and maybe others here have thoughts, is whether this hybrid training recipe holds up outside ARC-style evals. For example, do the agentic improvements transfer to messier, real-world workflows where APIs are undocumented, partial failures are common, and user input is full of ambiguity?
Can a small team working on ASI/domain-specific stick to scaling 2024-era best practices training stack? Or will they miss massive improvements?
No. Many are aimed at cleaning/aligning noisy, mixed-domain data. With abundant, high-quality domain data, you can skip most of the complexity and focus on direct SFT/RL on your corpus.
> Can a small team stick to scaling 2024-era best practices?
2024 was the year of SFT. I believe fitting reasoning traces to your final responses via RL is the technique-du-jour of 2025. Jumping from SFT to RL training might be biggest gain here if RL can be applied to your problem (e.g. math, coding etc).
edit: looks like i'm not the first person to notice this either regarding this poster. https://news.ycombinator.com/item?id=44279662
I think we have a duty to call this out, before the web becomes ridden with slop.
"...a fascinating approach..." (LLMs think everything is fascinating)
"...they're essentially having a generalist learn from a committee of specialists..." (analogies, analogies)
"...where APIs are undocumented, partial failures are common, and user input is full of ambiguity..." (typical AI rule of three template with semantically similar parameters that contribute nothing to the overall meaning)
Tinfoil hat time, but perhaps the bots don't like being called out? I don't actually take that statement seriously, but it seems an eventual avenue. They've long been seeding threads on Reddit to shape initial hive mind, i imagine that's going to get more advanced and widespread.
> ...are pure gold for anyone working in this space...
Specifically OpenAI
It felt interesting and informative to me, but I didn’t verify any of it.
Good eye btw.
(Re: Other post you linked to. it is entirely my own thoughts.)
OpenAI, Anthropic and others are not even publishing their weights.
Does anyone have any background information on the authors? Have they published similarly impressive works in the past?
-----
First, the question is: "How many bs in bluberry?" That seems like a typo. I think it's meant to be "blueberry," but it's spelled "bluberry" with one 'e' missing. Also, "bs" probably refers to the letter 'b'.
So, rephrasing the question: How many times does the letter 'b' appear in the word "blueberry"?
But the word given is "bluberry," which might be a misspelling of "blueberry." Let me confirm the correct spelling. The standard spelling is "blueberry," which has two 'e's: b-l-u-e-b-e-r-r-y.
The question says "bluberry," which is missing the second 'e', so it's spelled b-l-u-b-e-r-r-y. That might be intentional or a mistake. I should answer based on the word as given.
The question is: "How many bs in bluberry?" So, I need to count the 'b's in "bluberry."
Let me write it out: B-L-U-B-E-R-R-Y.
Now, I'll go through each letter:
Position 1: B (that's one b)
Position 2: L (not b)
Position 3: U (not b)
Position 4: B (that's the second b)
Position 5: E (not b)
Position 6: R (not b)
Position 7: R (not b)
Position 8: Y (not b)
So, there are two 'b's in "bluberry."
But is "bluberry" the correct word? It's probably a misspelling of "blueberry." In "blueberry," it's B-L-U-E-B-E-R-R-Y.
Position 1: B (one)
Position 2: L
Position 3: U
Position 4: E
Position 5: B (two)
Position 6: E
Position 7: R
Position 8: R
Position 9: Y
"Blueberry" has nine letters, and there are two 'b's: one at the start and one after the 'e'.
In "bluberry," as given, it's eight letters: B-L-U-B-E-R-R-Y, with two 'b's.
Perhaps the question is tricking me with the misspelling. But I should answer based on what's written.
Another thought: "bs" might be plural for 'b', meaning how many 'b's, or it could be something else. But in context, it's likely about the letter.
Maybe it's a riddle. "Bluberry" sounds like "blueberry," but spelled differently.
Perhaps "bs" refers to something else, like the fruit or something, but that seems unlikely. The question says "bs," which is probably "b's" meaning the letter b.
Let me read the question carefully: "How many bs in bluberry?"
It might be a play on words. "Bluberry" could be intended as "blueberry," and "bs" might be misinterpreted.
Another idea: in some contexts, "b" could stand for something, but here it's probably straightforward.
Perhaps it's about the sound. But the question is written, so it's about spelling.
I think I should just count the 'b's in the given word.
So, in "bluberry": letters are B, L, U, B, E, R, R, Y. So, B appears twice.
Therefore, there are two 'b's.
But to be thorough, let's consider if "bluberry" is a word. It's not standard; it's a misspelling. However, for the purpose of this question, I'll use it as is.
Perhaps the question has a typo, and it's meant to be "blueberry." But since it's written as "bluberry," I'll go with that.
Final answer should be 2.
The question says "bs," which might be lowercase, but in counting, case usually doesn't matter, and here it's all lowercase or mixed? The word is "bluberry," which is lowercase, and "bs" is lowercase, so it's fine.
In English, when counting letters, we don't distinguish case unless specified.
So, I think two is correct.
But let me double-check the spelling. The user wrote "bluberry," which has one 'u' and no 'e' after 'u', but in "blueberry," it's "blue" with 'e'.
In "bluberry," it's "blu" without 'e', then "berry."
But in terms of letter count, it's clear.
Perhaps it's "blueberry" misspelled, and I should correct it, but I think for accuracy, I'll use the given spelling.
Another thought: the question might be "how many b's in blueberry?" but written with a typo.
But since it's "bluberry," I'll answer based on that.
So, response should be: There are two 'b's in "bluberry."
To make it clear, I can say: The word "bluberry" contains the letter 'b' twice.
But the question is "how many bs," so I can just say 2.
Perhaps it's a trick with the word "berry" having one 'b', but "blu" has one, so total two.
"Blu" has one 'b', "berry" has one 'b', but in "bluberry," it's combined, so still two 'b's.
Yes.
I think I'm overthinking it.
Final answer: 2
-----It’s like asking us what the average wavelength is when looking at a blueberry: the information is actually there somewhere in our processing stack, but inaccessible to reasoning. It can be worked out logically from general knowledge, but probably inaccurately, and the gotcha of “you’re looking right at it and the photons are hitting your vision” is not much of a gotcha when you understand how it works.
> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.
This matches with the calculation I did for GLM-4.5 (355B A32B):
In [14]: 356732107008 - (775946240 * 2) # token_embd / output are 775946240 each. assume omitted
Out[14]: 355180214528
In [15]: 356732107008 - 339738624000 - (775946240 * 2) # parameters that are always active
Out[15]: 15441590528
In [16]: 339738624000 * 8 / 160 # parameters from activated experts
Out[16]: 16986931200.0
Meanwhile, GPT OSS series includes both the embedding layer and the output layer when counting the total parameters, but only includes the output layer when counting the active parameters:> We refer to the models as “120b” and “20b” for simplicity, though they technically have 116.8B and 20.9B parameters, respectively. Unembedding parameters are counted towards active, but not embeddings.
And Qwen3 series includes both the embedding layer and the output layer when counting both the total parameters and the active parameters.
Why there is no standard in counting? Which approach is more accurate?
For activated parameters: All unembedding parameters are used in every inference step during token generation, but only one column of the embeddings is used (if done right). So count accordingly, since that's what matters for memory bandwidth and therefore latency.
correct activated params:
* DeepSeek V3/R1 series
* Kimi K2
* GPT-OSS series
undercount activated params: * GLM-4.5 series
overcount activated params: * DeepSeek V2 series
* Qwen3 series
* Ernie 4.5 series
* Hunyuan A13B
ttul•5mo ago