frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

https://arxiv.org/abs/2510.13878
1•PaulHoule•2h ago

Comments

adamzwasserman•1h ago
I see problems:

The paper claims that Qwen3-4B achieved 89.2% best-arm selection by demonstrating superior "probabilistic reasoning". But this is a 2-armed bandit where random guessing should converge to ~50% over 500 runs of 25 iterations each. An 89% rate is suspiciously high and suggests to me that something else is happening (like prompt bias or the model pattern-matching rather than reasoning)

When they increase from 2 to 5 arms, Qwen3-4B drops from 89% to 6.5% accuracy. I assert that if it truly had probabilistic reasoning capability, performance would degrade more gracefully.

The "overthinking" explanation is hand-wavy. I don't see evidence or chain of reasoning. This is just a post-hoc story to explain unexpected results.

No discussion of variance, confidence intervals, or statistical significance. With 500 runs, these should be straightforward to calculate.

Does the claimed 89% accuracy in a binary choice task strike anyone else as implausibly high for what they're claiming?

Datus, a data engineering agent that builds evolvable context for data system

https://github.com/Datus-ai/Datus-agent
1•jinqueeny•12s ago•0 comments

Why We Create Our Suffering

https://chrislakin.blog/p/locally-optimal
1•eatitraw•1m ago•0 comments

The Complexity Cliff: Why Reasoning Models Work Right Up Until They Don't

https://rewire.it/blog/the-complexity-cliff-why-reasoning-models-work-until-they-dont/
1•jonbaer•2m ago•0 comments

Ask HN: VC-funded startup lurking in my community Slack – how to respond?

1•nduncan_hmc•2m ago•0 comments

LazyLLM, Easiest and laziest way for building multi-agent LLMs applications

https://github.com/LazyAGI/LazyLLM
1•jinqueeny•2m ago•0 comments

Space junk may have struck a Chinese crew ship in low-Earth orbit

https://arstechnica.com/space/2025/11/landing-postponed-for-chinese-astronauts-after-suspected-sp...
1•gnabgib•3m ago•0 comments

Photos: New Phoenix Microcenter is a 'tech-heaven' for geeks

https://www.phoenixnewtimes.com/arts-culture/micro-center-in-phoenix-a-look-inside-the-new-tech-h...
1•tortilla•5m ago•0 comments

LLKV: SQL and Apache Arrow and KV Storage

https://github.com/jzombie/rust-llkv
1•zombiej5•8m ago•1 comments

Bubble needs to be destroyed before it destroys the country

1•moosedman•11m ago•2 comments

What happened when Trump met Xi?

https://www.brookings.edu/articles/what-happened-when-trump-met-xi/
1•ilamont•17m ago•0 comments

Ask HN: How different is compute orchestration for AI?

1•alpb•18m ago•0 comments

The Power Problem – A Silicon Valley Story

https://syntheticauth.ai/posts/synthetic-auth-the-power-problem
1•zerolayers•20m ago•0 comments

Myopia's Global Impact, by the Numbers

https://europe.ophthalmologytimes.com/view/myopia-s-global-impact-by-the-numbers-prevalence-proje...
1•plun9•23m ago•0 comments

Marines offer tech-savvy recruits $15,000 to enlist

https://taskandpurpose.com/news/marines-bonuses-recruits-tech/
1•ilamont•24m ago•0 comments

Windows Update triggers BitLocker recovery on business PCs

https://www.windowslatest.com/2025/11/05/microsoft-warns-windows-11-25h2-24h2-october-update-trig...
10•jinxmeta•25m ago•0 comments

FAA reducing air traffic by 10% across 40 'high-volume' markets

https://apnews.com/article/government-shutdown-airlines-faa-e39c423facec2b0dcc2544af48de0fa1
5•awnird•25m ago•3 comments

OpenAI Wants Federal Backstop

https://finance.yahoo.com/video/openai-wants-federal-backstop-investments-201700279.html
4•vinyl7•29m ago•2 comments

Democrats gird for longer shutdown fight after election sweep

https://www.politico.com/news/2025/11/05/democrats-shutdown-fight-elections-00638207
2•moosedman•30m ago•0 comments

Show HN: Reverse Engineer Web Apps

https://github.com/VectorlyApp/web-hacker
2•rayruizhiliao•32m ago•0 comments

Takeaways from Trump's rocky Supreme Court arguments over global tariffs

https://www.cnn.com/2025/11/05/politics/takeaways-supreme-court-tariffs-trump
1•rawgabbit•39m ago•0 comments

Universal music group and udio announce new licensed AI music creation platform

https://www.universalmusic.com/universal-music-group-and-udio-announce-udios-first-strategic-agre...
1•lastdong•41m ago•0 comments

A PoC to make a backdoored PyTorch neural network

https://hacktelligence.org/backdoor_pytorch/
1•Eclipser•42m ago•0 comments

CoreWeave CEO Plays Down Concerns About AI-Spending Bubble

https://www.wsj.com/tech/ai/coreweave-ceo-plays-down-concerns-about-ai-spending-bubble-5a21a6ee
1•moosedman•44m ago•2 comments

Grammarly Rebrands Company as Superhuman, Introduces Superhuman Suite

https://www.grammarly.com/blog/company/announcing-company-rebrand-to-superhuman/
2•healsdata•44m ago•0 comments

Microsoft apologises, offers refunds to 2.7M Australians

https://www.smh.com.au/business/consumer-affairs/microsoft-apologises-offers-refunds-to-2-7-milli...
2•femto•46m ago•2 comments

Show HN: Which technologies are trending in job posts

https://trends.sumble.com/?techs=cursor%2Cclaude-code%2Cgithub-copilot
4•antgoldbloom•47m ago•0 comments

LeetCode for DevOps

https://labs.iximiuz.com/challenges
1•valyala•48m ago•0 comments

Louis Rossman – Why I uninstalled uBlock origin and switched to AdNauseam [video]

https://www.youtube.com/watch?v=7GeCq1qwqjc
4•anonycat•49m ago•0 comments

The Basic Laws of Human Stupidity (1987) [pdf]

https://gandalf.fee.urv.cat/professors/AntonioQuesada/Curs1920/Cipolla_laws.pdf
3•bookofjoe•50m ago•1 comments

Peatlands' 'huge reservoir' of carbon at risk of release, researchers warn

https://phys.org/news/2025-10-peatlands-huge-reservoir-carbon.html
2•PaulHoule•51m ago•0 comments