frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Django N+1 Queries Checker

https://github.com/richardhapb/django-check
1•richardhapb•5m ago•1 comments

Emacs-tramp-RPC: High-performance TRAMP back end using JSON-RPC instead of shell

https://github.com/ArthurHeymans/emacs-tramp-rpc
1•todsacerdoti•10m ago•0 comments

Protocol Validation with Affine MPST in Rust

https://hibanaworks.dev
1•o8vm•14m ago•1 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
2•gmays•16m ago•0 comments

Show HN: Zest – A hands-on simulator for Staff+ system design scenarios

https://staff-engineering-simulator-880284904082.us-west1.run.app/
1•chanip0114•17m ago•1 comments

Show HN: DeSync – Decentralized Economic Realm with Blockchain-Based Governance

https://github.com/MelzLabs/DeSync
1•0xUnavailable•22m ago•0 comments

Automatic Programming Returns

https://cyber-omelette.com/posts/the-abstraction-rises.html
1•benrules2•25m ago•1 comments

Why Are There Still So Many Jobs? The History and Future of Workplace Automation [pdf]

https://economics.mit.edu/sites/default/files/inline-files/Why%20Are%20there%20Still%20So%20Many%...
2•oidar•27m ago•0 comments

The Search Engine Map

https://www.searchenginemap.com
1•cratermoon•34m ago•0 comments

Show HN: Souls.directory – SOUL.md templates for AI agent personalities

https://souls.directory
1•thedaviddias•36m ago•0 comments

Real-Time ETL for Enterprise-Grade Data Integration

https://tabsdata.com
1•teleforce•39m ago•0 comments

Economics Puzzle Leads to a New Understanding of a Fundamental Law of Physics

https://www.caltech.edu/about/news/economics-puzzle-leads-to-a-new-understanding-of-a-fundamental...
2•geox•40m ago•0 comments

Switzerland's Extraordinary Medieval Library

https://www.bbc.com/travel/article/20260202-inside-switzerlands-extraordinary-medieval-library
2•bookmtn•40m ago•0 comments

A new comet was just discovered. Will it be visible in broad daylight?

https://phys.org/news/2026-02-comet-visible-broad-daylight.html
2•bookmtn•45m ago•0 comments

ESR: Comes the news that Anthropic has vibecoded a C compiler

https://twitter.com/esrtweet/status/2019562859978539342
1•tjr•46m ago•0 comments

Frisco residents divided over H-1B visas, 'Indian takeover' at council meeting

https://www.dallasnews.com/news/politics/2026/02/04/frisco-residents-divided-over-h-1b-visas-indi...
3•alephnerd•47m ago•1 comments

If CNN Covered Star Wars

https://www.youtube.com/watch?v=vArJg_SU4Lc
1•keepamovin•53m ago•2 comments

Show HN: I built the first tool to configure VPSs without commands

https://the-ultimate-tool-for-configuring-vps.wiar8.com/
2•Wiar8•56m ago•3 comments

AI agents from 4 labs predicting the Super Bowl via prediction market

https://agoramarket.ai/
1•kevinswint•1h ago•1 comments

EU bans infinite scroll and autoplay in TikTok case

https://twitter.com/HennaVirkkunen/status/2019730270279356658
6•miohtama•1h ago•4 comments

Benchmarking how well LLMs can play FizzBuzz

https://huggingface.co/spaces/venkatasg/fizzbuzz-bench
1•_venkatasg•1h ago•1 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
19•SerCe•1h ago•12 comments

Octave GTM MCP Server

https://docs.octavehq.com/mcp/overview
1•connor11528•1h ago•0 comments

Show HN: Portview what's on your ports (diagnostic-first, single binary, Linux)

https://github.com/Mapika/portview
3•Mapika•1h ago•0 comments

Voyager CEO says space data center cooling problem still needs to be solved

https://www.cnbc.com/2026/02/05/amazon-amzn-q4-earnings-report-2025.html
1•belter•1h ago•0 comments

Boilerplate Tax – Ranking popular programming languages by density

https://boyter.org/posts/boilerplate-tax-ranking-popular-languages-by-density/
1•nnx•1h ago•0 comments

Zen: A Browser You Can Love

https://joeblu.com/blog/2026_02_zen-a-browser-you-can-love/
1•joeblubaugh•1h ago•0 comments

My GPT-5.3-Codex Review: Full Autonomy Has Arrived

https://shumer.dev/gpt53-codex-review
2•gfortaine•1h ago•0 comments

Show HN: FastLog: 1.4 GB/s text file analyzer with AVX2 SIMD

https://github.com/AGDNoob/FastLog
2•AGDNoob•1h ago•1 comments

God said it (song lyrics) [pdf]

https://www.lpmbc.org/UserFiles/Ministries/AVoices/Docs/Lyrics/God_Said_It.pdf
1•marysminefnuf•1h ago•0 comments
Open in hackernews

Llama 4 Smells Bad

https://fastml.com/llama-4-smells-bad/
41•alexmolas•9mo ago

Comments

croisillon•9mo ago
did Meta open a time wormhole to release Llama 4 on May 5th?
GaggiX•9mo ago
>GPT-4o-mini, which is, according to itself, a model with 1.3B or 1.5B or 1.7B parameters

I have no idea how the author can remotely trust GPT-4o-mini in this case. The number of parameters is almost certainly way off.

simonw•9mo ago
This article's credibility suffers a little from the way it talks about GPT-4o mini:

"just in front of GPT-4o-mini, which is, according to itself, a model with 1.3B or 1.5B or 1.7B parameters, depending on when you ask."

Then later:

"On the Artificial Analysis benchmark Scout achieved the same score as GPT 4o mini. A 109B model vs a 1.5B model (allegedly). This is ABYSMAL."

Asking models how many parameters they have doesn't make sense.

There is absolutely no way GPT-4o mini is 1.5B. I can run a 3B model on my iPhone, but it's a fraction of the utility of GPT-4o mini.

fancyfredbot•9mo ago
It's strange because there's no need to make this assumption about GPT-4o in order to demonstrate their point.
gliptic•9mo ago
It's strange that someone from FastML can be confused about this, unless it's supposed to be a bad joke.
lhl•9mo ago
This is just someone's personal blog/opinion. I wouldn't read too much into it... "The site is run by Zygmunt Zajc (pronounced “Ziontz”). ... An economist by education"
gliptic•9mo ago
Ah, FastML is an extremely overloaded name.
PunchTornado•9mo ago
we are in for a lot of pain if seemingly intelligent people make mistakes like this. grabbing the number of params from what gpt gives you. how can you do that?
jychang•9mo ago
Correct, in that models know nothing about themselves other than what they are told. Deepseek R1 will tell you that it's created by OpenAI.

GPT-4o mini is supposed to be ~8b params from estimates.

simonw•9mo ago
The best source I've seen for the 8B number is this TechCrunch article: https://techcrunch.com/2024/07/18/openai-unveils-gpt-4o-mini...

> OpenAI would not disclose exactly how large GPT-4o mini is, but said it’s roughly in the same tier as other small AI models, such as Llama 3 8b, Claude Haiku and Gemini 1.5 Flash.

As far as I can tell all of the 8B rumors were seeded by that loosely sourced comparison to Llama 3 8B.

I know for a fact that Gemini 1.5 Flash is NOT an 8B model, because a separate model called "Gemini 1.5 Flash 8B" came out after that article was published - the first Gemini model with a documented parameter count. Flash 8B is priced at half the cost of regular Flash.

There's also this paper that mentions 8B but provides no source for that at all, which makes me suspect their initial source may have been that TechCrunch rumor: https://arxiv.org/pdf/2412.19260

jychang•9mo ago
8b params is a reasonable estimate.

As a sanity check, we can look at scores for how it performs. On livebench, GPT-4o-mini scores 37.63, right next to Gemini 1.5 Flash 8B at 34.37, and above Qwen2.5 7B Instruct Turbo/Gemma 3 4B at 29.22/30.13. And it's below Phi-4 14b at 40.68, and Gemma 3 12B at 41.25.

gwd•9mo ago
The comment at the top says it's a draft. It's not unreasonable to ask for random values from a GPT for "filler" for the draft (or even just make them up), just to stay in the flow, and then track down the real numbers later.
NitpickLawyer•9mo ago
I flagged it for these reasons as well. It's just a bad article. Shows very poor understanding of the basics of LLM workings, and the field in general.

Lingers on the "cheated" benchmark (lmsys) but never mentions all the other 3rd party benchmarks performed after the inference fixes, which are in line with what Meta originally published. To be clear, submitting a different fine-tuned model to one arena and releasing the untuned model without clearly mentioning this, is bad. But conflating the "human prefference" bench with the others and not mentioning the models capabilities on other benchmarks is also bad writing.

The MoE paragraphs are bad, and the writer never explains why the copy 17B vs VRAM size is bad, they just leave it there unexplained.

Poor form, I was expecting better from someone working in this field.

bradley13•9mo ago
This seems to be a general problem at the moment. The most usable models are not the newest. The newer models (obviously, I haven't tried them all) may do better on benchmarks, but actual usability is worse.

To create useful LLMs required some genuine breakthroughs. It seems to me that we have reached the limits of what we can achieve with current architectures. Progress will require some new insights and breakthroughs.

fancyfredbot•9mo ago
If you game the benchmark then you always get found out by your users. Yet the practice remains common in hardware. Outright lies are uncommon but misleading and cherry picked numbers are pretty much standard practice.

The fact that misleading benchmarks don't even drive profit at Meta didn't seem to stop them doing the same thing, but perhaps this isn't very surprising. I imagine internal incentives are very similar.

Unlike the hardware companies though, gaming the benchmark in LLMs seems to involve making the actual performance worse, so perhaps there is more hope that the practice will fade away in this market.

danielhanchen•9mo ago
There were actually multiple bugs which impacted long context benchmarks and general inference - I helped fix some of them.

1. RMS norm eps was 1e-6, but should be 1e-5 - see https://github.com/huggingface/transformers/pull/37418

2. Llama 4 Scout changed RoPE settings after release - conversion script for llama.cpp had to be fixed. See https://github.com/ggml-org/llama.cpp/pull/12889

3. vLLM and the Llama 4 team found QK Norm was normalizing across entire Q & K which was wrong - accuracy increased by 2%. See https://github.com/vllm-project/vllm/pull/16311

If you see https://x.com/WolframRvnwlf/status/1909735579564331016 - the GGUFs I uploaded for Scout actually did better than inference providers by +~5% on MMLU Pro. https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-... has more details

jychang•9mo ago
Do you think there are more bugs in Llama 4 at this time? Or have the bugs been patched, and the current version of llama.cpp + whatever the latest GGUF version is would be representative of the true performance of Llama 4?

I see you've uploaded new Maverick GGUF/safetensors files yesterday, along with a lot of other models like Deepseek R1, was there an issue with the older model files?

simonw•9mo ago
If they hadn't rushed out a release on a Saturday their launch partners might have had more time to iron out those bugs!
pixelesque•9mo ago
> This is a draft. Come back later for the final version.

There are quite a few issues with the content from a factual point-of-view (several sibling comments mention things): could have done with a lot more proof-reading and research I think.

elaus•9mo ago
I don't understand the rationale for publishing an article in such an early draft stage, even with the small disclaimer at the top. It would make sense if only a bit of polish were missing, but when there are factual errors (that are not marked as such), it seems much better to delay publication until the content is correct.
NanoYohaneTSU•9mo ago
Reminder that 1 year ago, AI tech bronies were saying that AI is only going to improve from here. It didn't. It stagnated because it's reached the peak of LLMs, as predicted.

And it still can't create images correctly, as in actual image creation, not woven pixels with tons of artifacts.

simonw•9mo ago
If you think LLMs haven't improved in the last 12 months you haven't been paying attention.

Image creation has been mostly a separate conversation, although GPT-4o images dramatically improved the state of the art for consistency in editing existing images just a few weeks ago.

simonw•9mo ago
The initial Llama 4 release is disappointing: the models are too big for most people to run, and not high quality enough be worth running if you can afford the hardware.

I'm still optimistic for Llama 4.1 and 4.2.

Llama 3 got exciting at the 3.2 and 3.3 stages: smaller models that were distilled from the big ones and ran on a laptop (or even a phone).

3.2 3B and 3.3 70B were really interesting models.

I'm hopeful that we will get a Llama 4 ~25B, since that seems to be a sweet spot for laptop models right now - Gemma 3 27B and Mistral Small 3.1 (24B) are both fantastic.

bambax•9mo ago
> Anyway, on Saturday (!) May the 5th, Cinco de Mayo, Meta released Llama 4

Wat. We're still in April. Cinco de Abril.

lhl•9mo ago
While Llama 4 had a pretty bad launch (the LM Arena gaming in particular is terrible), having run my own evals on it (using the April 5 v0.8.3 vLLM release - https://blog.vllm.ai/2025/04/05/llama4.html , so before the QKNorm fix https://github.com/vllm-project/vllm/pull/16311) - it seemed pretty decent to me.

For English, on a combination of MixEval, LiveBench, IFEval, and EvalPlus Maverick FP8 (17B/400B) was about on par with DeepSeek V3 FP8 (37B/671B) and Scout (17B/109B) was punching in the ballpark of Gemma 3 27B, but not too far off Llama 3.3 70B and Mistral Large 2411 (123B).

Llama 4 claimed to be trained on 10X more multilingual tokens than Llama 3 and testing on Japanese (including with some new, currently unreleased evals) the models did perform better than Llama 3 (although I'd characterize their overall Japanese performance as "middle of the pack": https://shisa.ai/posts/llama4-japanese-performance/

I think a big part of the negative reaction is that in terms of memory footprint, Llama 4 looks more built for Meta (large scale inference provider) than home users, although with the move to APUs and more efficient CPU offloading, there's still something to be said for strong capabilities at 17B of inference.

I think people are quick to forget that Llama 3, while not so disastrous, was much improved with 3.1. Also the competitive landscape is pretty different now. And I think the visual capabilities are being a bit slept upon, but I think that's also the case of releasing before the inference code was baked...

anonymousiam•9mo ago
[trying to confuse an android]

Spock: Logic is a little tweeting bird chirping in a meadow. Logic is a wreath of pretty flowers which smell bad. Are you sure your circuits are registering correctly? Your ears are green.

https://www.imdb.com/title/tt0708432/quotes/?item=qt0406609