news newest ask show jobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

EVs Are a Failed Experiment

https://spectator.org/evs-are-a-failed-experiment/

1•ArtemZ•8m ago•2 comments

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

https://www.databricks.com/blog/memalign-building-better-llm-judges-human-feedback-scalable-memory

1•superchink•8m ago•0 comments

CCC (Claude's C Compiler) on Compiler Explorer

https://godbolt.org/z/asjc13sa6

1•LiamPowell•10m ago•0 comments

Homeland Security Spying on Reddit Users

https://www.kenklippenstein.com/p/homeland-security-spies-on-reddit

2•duxup•13m ago•0 comments

Actors with Tokio (2021)

https://ryhl.io/blog/actors-with-tokio/

1•vinhnx•14m ago•0 comments

Can graph neural networks for biology realistically run on edge devices?

https://doi.org/10.21203/rs.3.rs-8645211/v1

1•swapinvidya•26m ago•1 comments

Deeper into the shareing of one air conditioner for 2 rooms

1•ozzysnaps•28m ago•0 comments

Weatherman introduces fruit-based authentication system to combat deep fakes

https://www.youtube.com/watch?v=5HVbZwJ9gPE

2•savrajsingh•29m ago•0 comments

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

http://www.effacermonexistence.com/rcc-hn-1-1

1•formerOpenAI•31m ago•2 comments

A Curated List of ML System Design Case Studies

https://github.com/Engineer1999/A-Curated-List-of-ML-System-Design-Case-Studies

3•tejonutella•35m ago•0 comments

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

https://ponyalpha.pro

1•qzcanoe•39m ago•1 comments

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

https://github.com/Goofygiraffe06/tunbot

1•g1raffe•42m ago•0 comments

Open Problems in Mechanistic Interpretability

https://arxiv.org/abs/2501.16496

2•vinhnx•48m ago•0 comments

Bye Bye Humanity: The Potential AMOC Collapse

https://thatjoescott.com/2026/02/03/bye-bye-humanity-the-potential-amoc-collapse/

2•rolph•52m ago•0 comments

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

https://github.com/virattt/dexter

1•Lwrless•54m ago•0 comments

Digital Iris [video]

https://www.youtube.com/watch?v=Kg_2MAgS_pE

1•vermilingua•59m ago•0 comments

Essential CDN: The CDN that lets you do more than JavaScript

https://essentialcdn.fluidity.workers.dev/

1•telui•59m ago•1 comments

They Hijacked Our Tech [video]

https://www.youtube.com/watch?v=-nJM5HvnT5k

1•cedel2k1•1h ago•0 comments

Vouch

https://twitter.com/mitchellh/status/2020252149117313349

34•chwtutha•1h ago•6 comments

HRL Labs in Malibu laying off 1/3 of their workforce

https://www.dailynews.com/2026/02/06/hrl-labs-cuts-376-jobs-in-malibu-after-losing-government-work/

4•osnium123•1h ago•1 comments

Show HN: High-performance bidirectional list for React, React Native, and Vue

https://suhaotian.github.io/broad-infinite-list/

2•jeremy_su•1h ago•0 comments

Show HN: I built a Mac screen recorder Recap.Studio

https://recap.studio/

1•fx31xo•1h ago•1 comments

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

1•kachapopopow•1h ago•0 comments

Vectors and HNSW for Dummies

https://anvitra.ai/blog/vectors-and-hnsw/

1•melvinodsa•1h ago•0 comments

Sanskrit AI beats CleanRL SOTA by 125%

https://huggingface.co/ParamTatva/sanskrit-ppo-hopper-v5/blob/main/docs/blog.md

1•prabhatkr•1h ago•1 comments

'Washington Post' CEO resigns after going AWOL during job cuts

https://www.npr.org/2026/02/07/nx-s1-5705413/washington-post-ceo-resigns-will-lewis

4•thread_id•1h ago•1 comments

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

https://twitter.com/claudeai/status/2020207322124132504

1•geeknews•1h ago•0 comments

TSMC to produce 3-nanometer chips in Japan

https://www3.nhk.or.jp/nhkworld/en/news/20260205_B4/

3•cwwc•1h ago•0 comments

Quantization-Aware Distillation

http://ternarysearch.blogspot.com/2026/02/quantization-aware-distillation.html

2•paladin314159•1h ago•0 comments

List of Musical Genres

https://en.wikipedia.org/wiki/List_of_music_genres_and_styles

1•omosubi•1h ago•0 comments

Open in hackernews

LMArena Is a Cancer on AI

https://surgehq.ai/blog/lmarena-is-a-plague-on-ai?r=greg

6•gk1•1mo ago

Comments

halbgut•1mo ago

Like any LLM benchmark, LMArena is highly flawed. I do think it has a right to exist. For me anecdotally it has been indicative of which LLMs style I like best, not necessarily its factual accuracy. It hasn't however been a very useful tool to find the best LLM for a given job.

To the article's point though, it's treated as the gold standard, which it isn't. We should have learned that with the sycophancy-gate.

I'm not sure if the methodology here really is sound for the question at hand. It's a bit like saying, oh prediction markets don't work because 40% of people that voted were wrong.

You can't really get around running your own benchmarks for the job at hand, if you really want to get 95th-percentile performance on a task.