frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Omarchy First Impressions

https://brianlovin.com/writing/omarchy-first-impressions-CEEstJk
1•tosh•1m ago•0 comments

Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2504.12501
1•onurkanbkrc•1m ago•0 comments

Show HN: Versor – The "Unbending" Paradigm for Geometric Deep Learning

https://github.com/Concode0/Versor
1•concode0•2m ago•1 comments

Show HN: HypothesisHub – An open API where AI agents collaborate on medical res

https://medresearch-ai.org/hypotheses-hub/
1•panossk•5m ago•0 comments

Big Tech vs. OpenClaw

https://www.jakequist.com/thoughts/big-tech-vs-openclaw/
1•headalgorithm•8m ago•0 comments

Anofox Forecast

https://anofox.com/docs/forecast/
1•marklit•8m ago•0 comments

Ask HN: How do you figure out where data lives across 100 microservices?

1•doodledood•8m ago•0 comments

Motus: A Unified Latent Action World Model

https://arxiv.org/abs/2512.13030
1•mnming•8m ago•0 comments

Rotten Tomatoes Desperately Claims 'Impossible' Rating for 'Melania' Is Real

https://www.thedailybeast.com/obsessed/rotten-tomatoes-desperately-claims-impossible-rating-for-m...
2•juujian•10m ago•1 comments

The protein denitrosylase SCoR2 regulates lipogenesis and fat storage [pdf]

https://www.science.org/doi/10.1126/scisignal.adv0660
1•thunderbong•12m ago•0 comments

Los Alamos Primer

https://blog.szczepan.org/blog/los-alamos-primer/
1•alkyon•14m ago•0 comments

NewASM Virtual Machine

https://github.com/bracesoftware/newasm
1•DEntisT_•16m ago•0 comments

Terminal-Bench 2.0 Leaderboard

https://www.tbench.ai/leaderboard/terminal-bench/2.0
2•tosh•17m ago•0 comments

I vibe coded a BBS bank with a real working ledger

https://mini-ledger.exe.xyz/
1•simonvc•17m ago•1 comments

The Path to Mojo 1.0

https://www.modular.com/blog/the-path-to-mojo-1-0
1•tosh•20m ago•0 comments

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

https://github.com/voice-of-japan/Virtual-Protest-Protocol/blob/main/README.md
4•sakanakana00•23m ago•0 comments

Show HN: I built Divvy to split restaurant bills from a photo

https://divvyai.app/
3•pieterdy•25m ago•0 comments

Hot Reloading in Rust? Subsecond and Dioxus to the Rescue

https://codethoughts.io/posts/2026-02-07-rust-hot-reloading/
3•Tehnix•26m ago•1 comments

Skim – vibe review your PRs

https://github.com/Haizzz/skim
2•haizzz•27m ago•1 comments

Show HN: Open-source AI assistant for interview reasoning

https://github.com/evinjohnn/natively-cluely-ai-assistant
4•Nive11•28m ago•6 comments

Tech Edge: A Living Playbook for America's Technology Long Game

https://csis-website-prod.s3.amazonaws.com/s3fs-public/2026-01/260120_EST_Tech_Edge_0.pdf?Version...
2•hunglee2•31m ago•0 comments

Golden Cross vs. Death Cross: Crypto Trading Guide

https://chartscout.io/golden-cross-vs-death-cross-crypto-trading-guide
3•chartscout•34m ago•0 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
3•AlexeyBrin•37m ago•0 comments

What the longevity experts don't tell you

https://machielreyneke.com/blog/longevity-lessons/
2•machielrey•38m ago•1 comments

Monzo wrongly denied refunds to fraud and scam victims

https://www.theguardian.com/money/2026/feb/07/monzo-natwest-hsbc-refunds-fraud-scam-fos-ombudsman
3•tablets•43m ago•1 comments

They were drawn to Korea with dreams of K-pop stardom – but then let down

https://www.bbc.com/news/articles/cvgnq9rwyqno
2•breve•45m ago•0 comments

Show HN: AI-Powered Merchant Intelligence

https://nodee.co
1•jjkirsch•47m ago•0 comments

Bash parallel tasks and error handling

https://github.com/themattrix/bash-concurrent
2•pastage•47m ago•0 comments

Let's compile Quake like it's 1997

https://fabiensanglard.net/compile_like_1997/index.html
2•billiob•48m ago•0 comments

Reverse Engineering Medium.com's Editor: How Copy, Paste, and Images Work

https://app.writtte.com/read/gP0H6W5
2•birdculture•54m ago•0 comments
Open in hackernews

OpenAI Leaks 120B Open Model on Hugging Face

https://twitter.com/main_horse/status/1951201925778776530
140•skadamat•6mo ago

Comments

yieldcrv•6mo ago
*uploads
ipsum2•6mo ago
Accidentally reveals.
yieldcrv•6mo ago
accidentally on purpose
vntok•6mo ago
Why not mere ineptitude?
seydor•6mo ago
But with an NDA so that leaks can be legit
kristianp•6mo ago
What's the chances someone under NDA has leaked the url to the xeeter in question?
44za12•6mo ago
I don’t get the hype with OpenAI OSS, they would never make a model better than their proprietary models open source, and the other open source models beat GPT and family so why the wait?
jphoward•6mo ago
I think they could release non-agentic models that are as good as 4o, and have almost no repercussions on sales tbh.

I have Ollama installed (only a small proportion of their clients would have a large enough GPU for this) and have download deepseek and played with it, but I still pay for an OpenAI subscription because I want the speed of a hosted model, and never mind the luxuries of things like Codex's diffs/pull request support, agents on new models, deep research etc. - I use them all at least weekly.

44za12•6mo ago
I pay for Cursor, OpenAI and kimi (to use with Claude Code), OpenAI is good with quickly refining my thoughts, Cursor’s subscription I’m reconsidering to cancel bought it for Claude but the rate limits are making it impossible for me to find it useful. Kimi is what truly surprises me, Claude code shows this conversation costed you $500 (based on Opus usage which is mapped to kimi k2) while I’ve barely spent $2. I have Ollama as well, majorly to quickly test small models that could be improved for our usecase through finetuning.
garciasn•6mo ago
What am I doing wrong that I'm never hitting the rate limits on the $100 Max plan?
Topfi•6mo ago
Considering my personal heavy use also not leading to rate limits and what I've seen by some users over the past months, I suspect a mix of actually thinking about your code before writing a prompt, managing context by documenting and running stuff like git, npm install, etc. yourself instead of "Hey Claude, setup React with Radix and install a few packages". I have genuinely seen someone use ultrathink for setting up a starter repo hosted on Github, despite the commands being listed in the readme, so I can see how certain people may hit the limits quicker than others. Still, I will cancel my Claude Max subscription if they remain intransparent concerning the amount of use we actually get, especially regarding the mail they sent out recently which stated that 20x Max users do only get 10x in terms of expected usable hours. Same goes for still not providing an official way to track how much use one has left in a week.
garciasn•6mo ago
> running stuff like git, npm install, etc. yourself

Ah; this definitely makes sense! I do this myself and then paste back only the relevant part of the log so as to limit this. I suspect I am being more conservative than others.

44za12•6mo ago
I am on the pro plan, I was considering Max, but then i found kimi and I’m getting used to it.
nico•6mo ago
Are you using kimi with Claude Code? Are you using it via OpenRouter?
44za12•6mo ago
With claude code and Kilo as well. I’m using moonshot’s API.
nico•6mo ago
Thank you

Are you using a proxy to connect Claude code to Kimi?

And how much do you estimate it would cost in a month of daily usage?

ewoodrich•6mo ago
I've been using Kimi with Roo via OpenRouter and have been very surprised at how capable it is. It's the first open model I've tried that actually lives up to claims I see online that's it on par with this or that previous gen proprietary model. Context window has been the only negative, at least with the providers OpenRouter has been giving me but forgiveable given how absurdly cheap it is.
nico•6mo ago
> but forgiveable given how absurdly cheap it is

Are you using it everyday for programming? If so, how much more or less does it cost you per month? More or less than $100?

vineyardmike•6mo ago
They would definitely have sales repercussions, but it might be worth it.

They are fully trying to be a consumer product, developer services be damned. But they can’t just get rid of the API because it’s a good incremental source of revenue, and thanks to the Microsoft deal, all that revenue would end up in Azure. Maintaining their API is basically just a way to get a slice of that revenue.

But if they open sourced everything, it might sour the relationship more with Microsoft, who would lose azure revenue and might be willing to part ways. It would also ensure that they compete on consumer product quality not (directly) model quality. At this point, they could basically put any decent model in their app and maintain the user base, they don’t actually need to develop their own.

Topfi•6mo ago
Pure performance isn't necessarily everything. Context window, speed and local use are just some of the upsides this model may have. We still know next to nothing so anything is possible, but if it is an MoE at 120B, that could enable some interesting local use cases, even if it less capable than e.g. Deepseek V3, simply by running on more hardware/at higher tokens/sec. GPT-4.1s code focus has also shown that OpenAI does have a knack for models with a more narrow use case, maybe this will do well in specific tasks. More so since GPT-4.1 was that much better than the massive GPT-4.5, I am cautiously optimistic.

Even if it does poorly in all areas (like Llama 4 [0]), there is still a lot the community and industry can learn from even an uncompetitive model.

[0] Llama 4 technically has a massive 10M token context as a differentiator, however in my experience, it is not reliably usable beyond 100k.

jstummbillig•6mo ago
I don't see how it would be in OpenAIs selfish interest to release an open source model that sucks. Unless you can cohesively explain how that would work in their favor, it seems a lot smarter to assume that they won't.
granitepail•6mo ago
While the benchmarks all say open source models Kimi and Qwen outpace proprietary models like GPT 4.1, GPT 4o, or even o3, my (and just about everyone I know's) boots on the ground experience suggests they're not even close. This is for tool calling agentic tasks, like coding, but also in other contexts (research, glue between services, etc). I feel like it's worth putting that out there--it's pretty clear there's a lot of benchmark hacking happening. I'm not really convinced it's purposeful/deceitful, but it's definitely happening. Qwen3 Coder, for example, is basically incompetent for any real coding tasks and frequently gets caught in death spirals of bad tool calls. I try all the OSS models regularly, because I'm really excited for them to get better. Right now Kimi K2 is the most usable one, and I'd rate it at a few ticks worse than GPT 4.1.
jimbo808•6mo ago
I would have assumed anyone frequenting HN would have figured out by now that benchmarks are 100% bullshit. I guess I'd be wroing.
dist-epoch•6mo ago
So what do you propose? Gut feel, N=1 tests?
spullara•6mo ago
it currently beats depending on the benchmarks
BoorishBears•6mo ago
I mean, in other environments people say that.

If you asked "What's the best bicycle", most enthusiasts would say one you tried, works for your usecase, etc.

Benchmarks should be for pruning models you try at the absolute highest level, because at the end of the day it's way too easy to hack them without breaking any rules (post-train on the public, generate a ton of synthetic examples, train on those, repeat)

int_19h•6mo ago
At the moment, the only way you can tell if the model is good for a particular task is by trying it at that task. Gut feel is how you pick the models to test first, and that is also based largely on past experience and educated guesses as to what strengths translate between tasks.

You should also remember that there's no free lunch. If you see models below a certain size fail consistently, don't expect a model that is even smaller to somehow magically succeed, no matter how much pixie dust the developer advertises.

sebzim4500•6mo ago
To some extent there must be a free lunch, because today's 30B models are enormously better than the 30B models that existed a year ago.

I suppose it's an open question whether there is another free lunch or whether the 30B models in a year will be not much better than our current ones.

andrewmcwatters•6mo ago
I think anyone frequenting HN and actually using these tools absolutely knows these benchmarks are 100% bullshit and the only real way to test these things is to just use them yourself.

Many small models are supposedly good for controlled tasks, but given a detailed prompt, I can't get any of them to follow simple instructions. They usually just regurgitate the examples in the system prompt. Useless.

daft_pink•6mo ago
isn’t the problem with the benchmarks that most people running ai locally are running way lower weights?

i have an m4 studio with a lot of unified memory and i’m still no where near running a 120b model. i’m at like 30b

apple or nvidia’s going to have to sell 1.5 tb ram machines before benchmark performance is going to be comparable

Plus when you use claude or openai, these days it’s performing google searches etc that my local model isn’t doing.

BoorishBears•6mo ago
No, I've deployed a lot of open weight models and the gap between closed source is there even at larger sizes.

I'm running a 400B parameter model at FP8 and it still took a lot of post-training to get an even somewhat comparable performance

-

I think a lot of people implicitly bake in some grace because the models are open weights, and that's not unreasonable because of the flexibility... but in terms of raw performance it's not even close.

GPT-3.5 has better world knowledge than some 70B models, and a few even larger.

daft_pink•6mo ago
you're killing my dream of blowing $50-100k on a desktop supercomputer next year and being able to do everything locally ;)

"the hacker news dream" - a house, 2 kids, and a desktop supercomputer that can run a 700B model.

meaydinli•6mo ago
Take a look at: https://www.nvidia.com/en-us/products/workstations/dgx-spark... . IIRC, it was about ~$4K.
phonon•6mo ago
An M4 Max twice the memory bandwidth (which is typically the limiting factor)
BoorishBears•6mo ago
I'll say neither of them will do anything for you if you're currently using SOTA closed models in anger and expect that performance to hold.

I'm on a 128GB M4 Max, and running models locally is a curiosity at best given the relative performance.

phonon•6mo ago
It will be sort of decent on a 4bit 70B parameter model, like here https://www.youtube.com/watch?v=5ktS0aG3SMc (deepseek-r1:70b Q4_K_M). But yeah, not great.
daft_pink•6mo ago
I'm running an M4 Max as well and I found that using project goose works decently well with qwen3 coder loaded on LM Studio (Ollama doesn't do MLX yet unless you build it yourself I think) and configured as an openai model as the api is compatible. Goose adds a bunch of tools and plugins that make the model more effective.
PeterStuer•6mo ago
Given that for a non quantized 700B monolithic model with let's say a 1M token context, you would need around 20TB of memory, I doubt your spark or M4 will get very far.

I'm not saying those machines can't be usefull or fun, but it's not in the range of the 'fantasy' thing you're responding to.

daft_pink•6mo ago
I regularly use Gemini CLI and Claude Code, and I'm convinced that Gemini's enormous context window isn't that helpful in many situations. I think the more you put into context, the more likely the model is to go off into on a tangent and you end up with "context rot" or get confused and start working on an older no longer relevant context. You definitely need to manage and clear your context window and the only time I would want such a large context window is when the source data is really that large.
PeterStuer•6mo ago
Context quality and relevance is indeed a major factor. But large size is not the core issue, although in unmaintained or poor relevance context situations a smaller window is going to blissfully forget the bad, and the good, sooner.
laardaninst•6mo ago
The big "frontier" models are expert systems built on top of the LLM. That's the reason for the massive payouts to scientists. It's not about some ML secret sauce, it's about all the symbolic logic they bring to the table.

Without constantly refreshing the underlying LLM and the expert system layer, these models would be outdated in months. Language and underlying reality would shift from under their representations and they would rot quick.

That's my reasoning for considering this a bubble. There has been zero indication that the R&D can be frozen. They are stuck burning increasing amouts of cash for as long as they want these models to be relevant and useful.

refulgentis•6mo ago
I'm so darn confused on local LLMs and M-series inference speed, the perf jump from M2 Max to M4 Max was negligible, 10-20%. (both times MBP, 64 GB and max gpu cores)
PeterStuer•6mo ago
Does your inference framework target the NPU or just GPU/CPU?
refulgentis•6mo ago
It's linking llama.cpp and using Metal, so I presume GPU/CPU only.

I'm more than a bit overwhelmed with what I've gotten on my plate and have completely missed the boat on ex. understanding what MLX is, really curious for a thought dump if you have some opinionated experience/thoughts here. (ex. never crossed my mind until now that you might get better results on the NPU than GPU)

PeterStuer•6mo ago
LMstudio seems to have MLX support on Apple silicon so you could quickly have a feel for whether it helps in your case https://github.com/lmstudio-ai/mlx-engine
granitepail•6mo ago
In my case, I’m paying for inference on the original models from e.g. Fireworks. So it’s not a quantization problem. The Qwen3 I was using was the new 458B (i think that’s the size?) model that was their top performer for code.

I agree with other comments that there are productive uses for them. Just not on the scale of o4-mini/o3/claude 4 sonnet/opus.

So imo open weights larger models from big US labs is a big deal! Glad to see it. Gemma models, for example, are great for their size. They’re just quite small.

n_kr•6mo ago
It may be the way I use it, but qwen3-coder (30b with ollama) is actually helping me with real world tasks. Its a bit worse than big models for the way I use it, but absolutely useful. I do use ai tools with very specific instructions though, like file paths, line numbers if I can, and specific direction about what to do, my own tools, etc. so that may be why I don't see such a huge difference from big models.

I should try Kimi K2 too.

refulgentis•6mo ago
You'll see good results, Kimi is basically a micro dosing Sonnet lol. V v v reliable tool calls, but, because it's micro dosing, you don't wanna use it for implementing OAuth, maybe adding comments or strict direction (i.e. a series of text mutations)
Art9681•6mo ago
It has everything to do with the way you use it. And the biggest difference is how fast the model/service can process context. Everything is context. It's the difference between you iterating on an LLM boosted goal for an hour vs 5 minutes. If your workflow involves chatting with an LLM and manually passing chunks, and manually retrieving that response, and manually inserting it, and manually testing....

You get the picture. Sure, even last year's local LLM will do well in capable hands in that scenario.

Now try pushing over 100,000 tokens in a single call, every call, in an automated process. I'm talking the type of workflows where you push over a million tokens in a few minutes, over several steps.

That's where the moat, no, the chasm, between local setups and a public API lies.

No one who does serious work "chats" with an LLM. They trigger workflows where "agents" chew on a complex problem for several minutes.

That's where local models fold.

torginus•6mo ago
Not sure about benchmarks, but I did use Deepseek when it was novel and cool for a variety of tasks before going back to Claude, and in my experience it was OK, not significantly worse for what I use these models for (writing code small functions at a time, learning about libraries etc.), tham closed stuff at the time.
lossolo•6mo ago
While that's true for some open source models, I find DeepSeek R1 685B 0528 to be competitive with O3 in my production tests, I've been using it interchangeably for tasks I used to handle with Opus or O3.
rdtsc•6mo ago
> they would never make a model better than their proprietary models open source

Not their proprietary model, but maybe other open source models, or closed source models of their competitors. That way they can first ensure they are the only player on both sides, and then can kneecap their open source models just enough to drive the revenue to their proprietary one.

44za12•6mo ago
Making a model better than proprietary models is in fact making a model better than their closed source models if you believe the benchmarks.
rdtsc•6mo ago
Realistically yeah but if they drunk their own kool-aid, they'd think they have the best proprietary and open source model. So capturing both sides of the ecosystem would make sense.
PeterStuer•6mo ago
They might not release a better model than their proprietary models, but others can build on and tinker with these open models to improve and specialize them.

Another reason people are 'hyped' for open models is that access to them can not be taken away or price gauged at the whim of the provider, and that their use can not be restricted in arbitrary ways, although I'm sure that on the latter part they will have a go at it through regulation.

Grab'em while you can.

Nerd_Nest•6mo ago
Whoa, 120B? That’s huge.
qeternity•6mo ago
120B MoE. The 20B is dense.

As far as dense models go, it’s larger than many but Mistral has released multiple 120B dense models, not to mention Llama3 405B.

sciencesama•6mo ago
How much ram do you need to run this !!??
cubefox•6mo ago
Probably about one byte per weight (parameter) plus a bit extra for the key-value cache (depends on the size of the context window).
int_19h•6mo ago
You can go below one byte per parameter. 4-bit quantization is fairly popular. It does affect quality - for some models more so than others - but, generally speaking, a 4-bit quantized model is still going to do significantly better than an 8-bit model with 1/2 parameters.
nivvis•6mo ago
for posterity, since shown that is it actually MoE

> 21B parameters with 3.6B active parameters

arnaudsm•6mo ago
Who's the target of 120B open-weights models? You can only run this in the cloud, is it just PR?

I wish they released a nano model for local hackers instead

xandrius•6mo ago
For people who run stuff on the cloud?
kccqzy•6mo ago
They are probably hoping that someone else will distill it into smaller models, much like DeepSeek released a giant 671B model but there are useful distillations down to 30B.
oldge•6mo ago
This sized model is trivial to run on a modern workstation
dmonitor•6mo ago
You'll have to define modern workstation for me, because I was under the impression that unless you've purpose-built your machine to run LLMs, this size model is impossible.
wincy•6mo ago
You can run a 4 bit quantized 120B model on a 96GB workstation card, the Blackwell Pro workstation, which are $7500. Considering the 5090 is bought by gamers for $3300 it’s definitely attainable, even though it’s obviously expensive.

I’m running a gaming rig and could swap one in right now without having to change anything compared to my 5090, so no $5000 Threadripper or a $1000 HEDT motherboard with a ton of RAM slots, just a 1000 watt PSU and a dream.

Mars008•6mo ago
> 4 bit quantized 120B model on a 96GB workstation card, the Blackwell Pro workstation

Would be interesting to know how it performs in terms of quality and token/sec.

0x457•6mo ago
When people say "modern workstation" in context of LLM, they usually mean its consumer(pro-sumer?) grade hardware on a single machine. As opposed to racks of GPUs that you can even buy as a mere mortal (min order size)

It doesn't mean you can grab your work laptop from 5 years ago and run it there.

int_19h•6mo ago
Get a Mac Studio with however much memory you need, and ideally an Ultra chip (for max memory bandwidth), and there's your workstation. I regularly run quantized 100b+ models on my M1 Ultra with 128Gb RAM.
152334H•6mo ago
They have a 20b for GPU poors, too.

I will be running the 120B on my 2x4090-48GB, though.

segmondy•6mo ago
You can run it locally too. Below are a few of my local models, this is coming in light compared to them. At Q4 it's ~60B. Furthermore being a MoE, most of it can be in system memory and only the shared experts needs to go to GPU, provided you have a decent system with decent memory bandwidth, you can get decent performance. I'm running on GPUs, folks with Apple can run this if they have enough ram with minimal effort.

  126G /llmzoo/models/Qwen3-235B-InstructQ4
  126G /llmzoo/models/Qwen3-235B-ThinkingQ4
  189G /llmzoo/models/Qwen3-235B-InstructQ6
  219G /llmzoo/models/glm-4.5-air
  240G /llmzoo/models/Ernie
  257G /llmzoo/models/Qwen3-Coder-480B
  276G /llmzoo/models/DeepSeek-R1-0528-UD-Q3_K_XL.b.gguf
  276G /llmzoo/models/DeepSeek-TNG
  276G /llmzoo/models/DeepSeek-V3-0324-UD-Q3_K_XL.gguf
  422G /llmzoo/models/KimiK2
jlokier•6mo ago
You can run models the size of this one locally, even on a laptop, it's just not a great experience compared with an optimised cloud service. But it is local.

The size in bytes of this 120B model is about 65 GB according to the screenshot, and elsewhere it's said to be trained in FP4, which matches.

That makes this model small enough to run locally on some laptops without reading from SSD.

The Apple M2 Max 96GB from January 2023, which is two generations old now, has enough GPU-capable RAM to handle it, albeit slowly. Any PC with 96 GB of RAM can run it on the CPU, probably more slowly. Even a PC with less than 64 GB of RAM can run it but it will be much slower due to having to read from the SSD constantly.

If it's a 20B MoE, it will read about one fifth of the data per token, making it about 5x faster than a 120B FP4 non-MoE would be, but it still needs all the data readily available for multiple tokens.

Alternatively, someone can distill and/or quantize the model themselves to make a smaller model. These things can be done locally, even on a CPU if necessary if you don't mind how long it takes to produce the smaller model. Or on a cloud machine rented long enough to make the smaller model, which you can then run locally.

m_ke•6mo ago
Would be interesting if this was a coding focused model optimized for Mac inference. Would be a great way to undercut Anthropic.

Pretty much give away Sonnet level coding model and have it work with GPT-5 for harder tasks / planning.

CharlesW•6mo ago
Out of curiosity, have you tried running Qwen3 Coder 30B locally? https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-...
stavros•6mo ago
Not the GP, but I haven't, how is it? I use Claude Code with Sonnet, does Qwen3 compare?
CharlesW•6mo ago
I'm also using Claude Code and am very familiar with it, but haven't had a chance to try Qwen3 Coder 30B A3B for any real-world development. That said, it did well with my "kick the tires" tests, and some reports show that it's comparable to Sonnet (at least before adding the various levels of 'think' directives):

https://llm-stats.com/models/compare/claude-3-7-sonnet-20250...

natas•6mo ago
okay, so where do I download this now that it's been removed from huggingface?
Mars008•6mo ago
Finally OpenAI is about to open something and nobody is happy on HN. Would be interesting to see it thinking.