frontpage.

I thought I was going crazy, trying to use Gemini 3.5 Flash to rate some answers, but it kept giving 7 instead of 10 for correct answers.

Apparently once you add a "Grading criteria" text, the model collapses into a "compressed toward the center of the scale" hallucination (or training set overfitting).

Someone on X asked me to try to reproduce it, and I actually got it on the first try on their Gemini Chat:

https://x.com/XCSme/status/2057613611959279988

I am not sure what to make of this (or most SOTA) models. They got a lot smarter with coding and tool usage, but a lot dumber in other ways...

Why does it look like LLMs consistently overestimate implementation time?

Tell HN: Gemini 3.5 Flash breaks in stupid ways

Ask HN: Failing interviews for mid-level SWE in UK, advice please

Debatable but likely not insane: there MAY be an issue with SpaceX' hiring

Tell HN: I'm tired of AI-generated answers

Ask HN: Shouldn't Google need to give a public statement about Railway incident?

Valgrind-3.27.1 Is Available

Ask HN: Is HN Blocking Mullvad VPN?

Ask HN: Anyone else struggling with AI and work?

Ask HN: Are there any serious efforts to organize tech labor now?

Alternatives to HN for "tech outside of AI" discussion?

Ask HN: Are there any social media sites that are AI positive?

Tell HN: Google banned Railway's account. Everything down

Can one run AI on source code with the prompt "Find below-avg swear rate files"?

Ask HN: How does everyone talk about their work when they've used AI?

Ask HN: How to manage AI APIs for SaaS application?

Ask HN: Suggest Google Antigravity Alternative

Ask HN: How to make a mono-repo AI-Ready?

Ask HN: Sorry, what Was FiveThirtyEight?

Ask HN: Does root have to be uid 0? Does uid 0 have to be root?

Did moving to new place have intended effect?

Ask HN: What are Stainless users doing now that Anthropic has killed it?

Do you enjoy reading any type of AI written text?

Anthropic is killing stainless, so we built our own SDK/MCP generator

Ask HN: Antigravity 2.0 installer breaks existing Antigravity IDEs

Ask HN: Is grpcurl home page compromised?

Why does it look like LLMs consistently overestimate implementation time?

Tell HN: Gemini 3.5 Flash breaks in stupid ways

Ask HN: Failing interviews for mid-level SWE in UK, advice please

Debatable but likely not insane: there MAY be an issue with SpaceX' hiring

Tell HN: I'm tired of AI-generated answers

Ask HN: Shouldn't Google need to give a public statement about Railway incident?

Valgrind-3.27.1 Is Available

Ask HN: Is HN Blocking Mullvad VPN?

Ask HN: Anyone else struggling with AI and work?

Ask HN: Are there any serious efforts to organize tech labor now?

Alternatives to HN for "tech outside of AI" discussion?

Ask HN: Are there any social media sites that are AI positive?

Tell HN: Google banned Railway's account. Everything down

Can one run AI on source code with the prompt "Find below-avg swear rate files"?

Ask HN: How does everyone talk about their work when they've used AI?

Ask HN: How to manage AI APIs for SaaS application?

Ask HN: Suggest Google Antigravity Alternative

Ask HN: How to make a mono-repo AI-Ready?

Ask HN: Sorry, what Was FiveThirtyEight?

Ask HN: Does root have to be uid 0? Does uid 0 have to be root?

Did moving to new place have intended effect?

Ask HN: What are Stainless users doing now that Anthropic has killed it?

Do you enjoy reading any type of AI written text?

Anthropic is killing stainless, so we built our own SDK/MCP generator

Ask HN: Antigravity 2.0 installer breaks existing Antigravity IDEs

Ask HN: Is grpcurl home page compromised?

Tell HN: Gemini 3.5 Flash breaks in stupid ways

Comments