We can say that 64GB addressable by a GPU is not exceptional when compared to 128GB and it still costs less than a month's pay for a FAANG engineer, but the fact that they aren't actually purchasable right now shows that it's not as easy as driving to Best Buy and grabbing one off the shelf.
If you're looking for the cheapest way into 64 of unified memory, the Mac mini is available with an M4 Pro and 64GB at $1999.
So, truly, not "exceptional" unless you consider the price to be exorbitant (it's not, as evidenced by the long useful life of an M-series Mac).
The models have got so much better without me needing to upgrade my hardware.
No need to overly quantize our headlines.
"64GB M2 makes Space Invaders-- can be bought for under $xxxx"
i have this feeling with LLM's generated react frontend, they all look the same
Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...
Trivially, humans don't emit something they don't know either. You don't spontaneously figure out Javascript from first principles, you put together your existing knowledge into new shapes.
Nontrivially, LLMs can absolutely produce code for entirely new requirements. I've seen them do it many times. Will it be put together from smaller fragments? Yes, this is called "experience" or if the fragments are small enough, "understanding".
Humans can explore what they don’t know. AIs can’t.
Based on my experience with present day AIs, I personally wouldn't be surprised at all that if you showed Gemini 2.5 Pro a video of an insect colony and asked it "Take a look at the way they organize and see if that gives you inspiration for an optimization algorithm", it will spit something interesting out.
I couldn't do that with an ant colony. I would have to train on ant research first.
(Oh, and AIs can absolutely explore what they don't know. Watch a Claude Code instance look at a new repository. Exploration is a convergent skill in long-horizon RL.)
Nothing ultimately matters in this business except the first couple of time derivatives.
Surely this is exactly what current AI do? Observe stuff and apply that observation? Isn't this the exact criticism, that they aren't inventing ant colonies from first principles without ever seeing one?
> Humans can explore what they don’t know. AIs can’t.
We only learned to decode Egyptian hieroglyphs because of the Rosetta Stone. There's no translation for North Sentinelese, the Voynich manuscript, or Linear A.
We're not magic.
I think most people writing software today are reinventing a wheel, even in corporate environments for internal tools. Everyone wants their own tweak or thinks their idea is unique and nobody wants to share code publicly, so everyone pays programmers to develop buggy bespoke custom versions of the same stuff that's been done 100 times before.
I guess what I'm saying is that your requirements are probably not new, and to the extent they are yes an LLM can fill in the blanks due to its fluency in languages.
There is no understanding, regardless of the wants of all the capital investors in this domain.
(I have much respect for what you have done and are currently doing, but you did walk right into that one)
That is a much, much bigger deal than you make it sound like.
Compression may, in fact, be all we need. For that matter, it may be all there is.
People really need to stop saying this. I get that it was the Smart Guy Thing To Say in 2023, but by this point it’s pretty clear that that it’s not true in any way that matters for most practical purposes.
Coding LLMs have clearly been trained on conversations where a piece of code is shown, a transformation is requested (rewrite this from Python to Go), and then the transformed code is shown. It’s not that they’re just learning codebases, they’re learning what working with code looks like.
Thus you can ask an LLM to refactor a program in a language it has never seen, and it will “know” what refactoring means, because it has seen it done many times, and it will stand a good chance of doing the right thing.
That’s why they’re useful. They’re doing something way more sophisticated than just “recombining codebases from their training data”, and anyone chirping 2023 sound bites is going to miss that.
If they only recalled they wouldn’t “hallucinate”. What’s a lie if not an invention? So clearly they can come up with data that they weren’t trained on, for better or worse.
We went from chatgpt's "oh, look, it looks like python code but everything is wrong" to "here's a full stack boilerplate app that does what you asked and works in 0-shot" inside 2 years. That's the kicker. And the sauce isn't just in the training set, models now do post-training and RL and a bunch of other stuff to get to where we are. Not to mention the insane abilities with extended context (first models were 2/4k max), agentic stuff, and so on.
These kinds of comments are really missing the point.
Even then, when you start to build up complexity within a codebase - the results have often been worse than "I'll start generating it all from scratch again, and include this as an addition to the initial longtail specification prompt as well", and even then... it's been a crapshoot.
I _want_ to like it. The times where it initially "just worked" felt magical and inspired me with the possibilities. That's what prompted me to get more engaged and use it more. The reality of doing so is just frustrating and wishing things _actually worked_ anywhere close to expectations.
I am definitely at a point where I am more productive with it, but it took a bunch of effort.
If I didn't have an LLM to figure that out for me I wouldn't have done it at all.
That's what I've done with my ffmpeg LLM queries, anyway - can't speak for simonw!
• https://stackoverflow.com/questions/10957412/fastest-way-to-...
• https://superuser.com/questions/984850/linux-how-to-extract-...
• https://www.aleksandrhovhannisyan.com/notes/video-cli-cheat-...
• https://www.baeldung.com/linux/ffmpeg-extract-video-frames
• https://ottverse.com/extract-frames-using-ffmpeg-a-comprehen...
Search engines have been able to translate "vague natural language queries" into search results for a decade, now. This pre-existing infrastructure accounts for the vast majority of ChatGPT's apparent ability to find answers.
Not comparable and I fail to see why going through Google's ads/results would be better?
Meanwhile I've spent the past two years constantly building and implementing things I never would have done because of the reduction in friction LLM assistance gives me.
I wrote about this first two years ago - AI-enhanced development makes me more ambitious with my projects - https://simonwillison.net/2023/Mar/27/ai-enhanced-developmen... - when I realized I was hacking on things with tech like AppleScript and jq that I'd previously avoided.
It's hard to measure the productivity boost you get from "wouldn't have built that thing" to "actually built that thing".
That explains a lot about Django that the author is allergic to man pages lol
... on line 3,218: https://gist.github.com/simonw/6fc05ea7392c5fb8a5621d65e0ed0...
(I am very confident I am not the only person who has been deterred by ffmpeg's legendarily complex command-line interface. I feel no shame about this at all.)
But if you approach ffmpeg from the perspective of "I know this is possible", you are always correct, and can almost always reach the "how" in a handful of minutes.
Whether that's worth it or not, will vary. :)
*nix man pages are the same: if you already know which tool can solve your problem, they're easy to use. But you have to already have a shortlist of tools that can solve your problem, before you even know which man pages to read.
Sure, use the LLM to get over the initial hump. But ffmpeg's no exception to the rule that LLM's produce subpar code. It's worth spending a couple minutes reading the docs to understand what it did so you can do it better, and unassisted, next time.
If you're happy with results like that, sure, LLMs miss "a few tricks"...
But this does remind me of a previous co-worker. Wrote something to convert from a custom data store to a database, his version took 20 minutes on some inputs. Swore it couldn't possibly be improved. Obviously ridiculous because it didn't take 20 minutes to load from the old data store, nor to load from the new database. Over the next few hours of looking at very mediocre code, I realised it was doing an unnecessary O(n^2) check, confirmed with the CTO it wasn't business-critical, got rid of it, and the same conversion on the same data ran in something like 200ms.
Over a decade before LLMs.
But I keep being told “AI” is the second coming of Ahura Mazda so it shouldn’t do stuff like that right?
Niche reference, I like it.
But… I only hear of scammers who say, and psychosis sufferers who think, LLMs are *already* that competent.
Future AI? Sure, lots of sane-seeming people also think it could go far beyond us. Special purpose ones have in very narrow domains. But current LLMs are only good enough to be useful and potentially economically disruptive, they're not even close to wildly superhuman like Stockfish is.
ChatGPT will get better at chess over time. Stockfish will not get better at anything except chess. That's kind of a big difference.
Oddly, LLMs got worse at specifically chess: https://dynomight.net/chess/
But even to the general point, there's absolutely no agreement how much better the current architectures can ultimately get, nor how quickly they can get there.
Do they have potential for unbounded improvements, albeit at exponential cost for each linear incremental improvement? Or will they asymptomatically approach someone with 5 years experience, 10 years experience, a lifetime of experience, or a higher level than any human?
If I had to bet, I'd say current models have an asymptomatic growth converging to a merely "ok" performance; and separately claim that even if they're actually unbounded with exponential cost for linear returns, we can't afford the training cost needed to make them act like someone with even just 6 years professional experience in any given subject.
Which is still a lot. Especially as it would be acting like it had about as much experience in every other subject at the same time. Just… not a literal Ahura Mazda.
(Shrug) People with actual money to spend are betting twelve figures that you're wrong.
Should be fun to watch it shake out from up here in the cheap seats.
For "pretty good", it would be worth 14 figures, over two years. The global GDP is 14 figures. Even if this only automated 10% of the economy, it pays for itself after a decade.
For "Ahura Mazda", it would easily be worth 16 figures, what with that being the principal God and god of the sky in Zoroastrianism, and the only reason it stops at 16 is the implausibility of people staying organised for longer to get it done.
The more I've used it, the more I've disliked how poor the results it's produced, and the more I've realised I would have been better served by doing it myself and following a methodical path for things that I didn't have experience with.
It's easier to step through a problem as I'm learning and making small changes than an LLM going "It's done, and production ready!" where it just straight up doesn't work for 101 different tiny reasons.
It's kind of funny, because sometimes these tools are magical and incredible, and sometimes they are extremely stupid in obvious ways.
Yes, these are impressive, and especially so for local models that you can run yourself, but there is a gap between "absolutely magical" and "pretty cool, but needs heavy guiding" depending on how heavily the ground you're treading has been walked upon.
For a heavily explored space, it's like being impressed that you're 2.5 year old M2 with 64 GB RAM can extract some source code from a zip file. It's worth being impressed and excited about the space and the pace of improvement, but it's also worth stepping back and thinking rationally about the specific benchmark at hand.
I agree with you, but your take is much more nuanced than what the GP comment said! These models don't simply regurgitate the training set. That was my point with gpt3. The models have advanced from that, and can now "generalise" over the context in ways they could not do ~3 years ago. We are now at a point where you can write a detailed spec (10-20k tokens) for an unseen scripting language, and have SotA models a) write a parser and b) start writing scripts for you in that language, even though it never saw that particular scripting language anywhere in its training set. Try it. You'll be surprised.
Showing off moderately complicated results that are actually not indicative of performance because they are sniped by the training data turns this from a cool demo to a parlor trick.
Stating that, aha, jokes on you, that's the status quo, is an even bigger indictment.
To show that LLM actually can provide value for one-shot programming, you need to find a problem that there's no fully working sample code available online. I'm not trying to say that LLM couldn't to that. But just because LLM can come up with a perfectly-working Space Invaders doesn't mean that it could do that.
That's the goal for these projects anyways. I don't know that its true or feasible. I find the RAG models much more interesting myself, I see the technology as having far more value in search than generation.
Rather than write some markov-chain reminiscent frankenstein function when I ask it how to solve a problem, I would like to see it direct me to the original sources it would use to build those tokens, so that I can see their implementations in context and use my judgement.
Sadly that's not feasible with transformer-based LLMs: those original sources are long gone by the time you actually get to use the model, scrambled a billion times into a trained set of weights.
One thing that helped me understand this is understanding that every single token output by an LLM is the result of a calculation that considers all X billion parameters that are baked into that model (or a subset of that in the case of MoE models, but it's still billions of floating point calculations for every token.)
You can get an imitation of that if you tell the model "use your search tool and find example code for this problem and build new code based on that", but that's a pretty unconventional way to use a model. A key component of the value of these things is that they can spit out completely new code based on the statistical patterns they learned through training.
I tried to push for this type of model when an org I worked with over a decade ago was first exploring using the first generation of Tensorflow to drive customer service chatbots and was sadly ignored.
I totally get the value of RAG style patterns for information retrieval against factual information - for those I don't want the LLM to answer my question directly, I want it to run a search and show me a citation and directly quote a credible source as part of answering.
For code I just want code that works - I can test it myself to make sure it does what it's supposed to.
That is what you're doing already. You're just relying on a vector compression and search engine to hide it from you and hoping the output is what you expect, instead of having it direct you to where it remixed those snippets from so you can see how they work to start with and make sure its properly implemented from the get-go.
We all want code that works, but understanding that code is a critical part of that for anything but a throw-away one time use script.
I don't really get this desire to replace critical thought with hoping and testing. It sounds like the pipe dream of a middle manager, not a tool for a programmer.
I'm going to review the code anyway, why would I not want to save myself some of the work? I can "see how they work" after the LLM gives them to me just fine.
If you instead have a set of sources related to your problem, they immediately come with context, usage and in many cases, developer notes and even change history to show you mistakes and adaptations.
You're ultimately creating more work for yourself* by trying to avoid work, and possibly ending up with an inferior solution in the process. Where is your sense of efficiency? Where is your pride as a intellectual?
* Yes, you are most likely creating more work for yourself even if you think you are capable of telling otherwise. [1]
1. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
I would encourage you to consider that even LLM-generated code can teach you a ton of useful new things.
Go read the source code for my dumb, zero-effort space invaders clone: https://github.com/simonw/tools/blob/main/space-invaders-GLM...
There's a bunch of useful lessons to be picked up even from that!
- Examples of CSS gradients, box shadows and flexbox layout
- CSS keyframe animation
- How to implement keyboard events in JavaScript
- A simple but effective pattern for game loops against a Canvas element, using requestAnimationFrame
- How to implement basic collision detection
If you've written games like this before these may not be new to you, but I found them pretty interesting.
He's using AI with note taking apps for meetings to enhance notes and flush out technology ideas at a higher level, then refining those ideas into working experiments.
It's actually impressive to see. My personal experience has been far more disappointing to say the least. I can't speak to the code quality, consistency or even structure in terms of most people being able to maintain such applications though. I've asked to shadow him through a few of his vibe coding sessions to see his workflow. It feels rather alien to me, again my experience is much more disappointing in having to correct AI errors.
I've worked with him off and on for years from simulating aircraft diagnostics hardware to incident command simulation and setting up core infrastructure for F100 learning management backends.
I disagree. In my experience, asking coding tools to produce something similar to all of the tutorials and example code out there works amazingly well.
Asking them to produce novel output that doesn’t match the training set produces very different results.
When I tried multiple coding agents for a somewhat unique task recently they all struggled, continuously trying to pull the solution back to the standard examples. It felt like an endless loop of the models grinding through a solution and then spitting out something that matched common examples, after which I had to remind them of the unique properties of the task and they started all over again, eventually arriving back in the same spot.
It shows the reality of working with LLMs and it’s an important consideration.
https://www.web-leb.com/en/code/2108
Your "AI tools" are just "copyright whitewashing machines."
These kinds of comments are really ignoring reality.
I was not able to just download a 8-16GB File and then it would be able to generate A LOT of different tools, games etc. for me in multiply programming languages while in parallel ELI5 me research papers, generate svgs and a lot lot lot more.
But hey.
Yes, the open-models have surpassed my expectations in both quality and speed of release. For a bit of context, when chatgpt launched in Dec22, the "best" open models were GPT-J(~6-7B) and GPT-neoX (~22B?). I actually had an app running live, with users, using gpt-j for ~1 month. It was a pain. The quality was abysmal, there was no instruction following (you had to start your prompt like a story, or come up with a bunch of examples and hope the model will follow along) and so on.
And then something happened, LLama models got "leaked" (I still think it was a on purpose leak - don't sue us, we never meant to release, etc), and the rest is history. With L1 we got lots of optimisations like quantised models, fine-tuning and so on, L2 really saw fine-tuning go off (most of the fine-tunes were better than what meta released), we got alpaca showing off LoRA, and then a bunch of really strong models came out (mistrals, mixtrals, L3, gemmas, qwens, deepseeks, glms, granites, etc.)
By some estimations the open models are ~6mo behind what SotA labs have released. (note that doesn't mean the labs are releasing their best models, it's likely they keep those in house to use on next runs data curation, synthetic datasets, for distilling, etc). Being 6mo behind is NUTS! I never in my wildest dreams believed we'll be here. In fact I thought it would take ~2years to reach gpt3.5 levels. It's really something insane that we get to play with these models "locally", fine-tune them and so on.
My own benchmarks has a bunch of different tasks I use various local models for, and I run it when I wanna see if a new model is better than the existing ones I use. The output is basically a markdown table with a description of which model is best for what task.
They're being sold as general purpose things that are better/worse at everything but reality doesn't reflect this, they all have very specific tasks they're worse/better at, and the only way to find that out is by having a private benchmark you run yourself.
> what specific tasks is one performing better than the other?
That's exactly why you create your own benchmark, so you can figure that out by just having a list of models, instead of testing each individually and basing it on "feels better".
> Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model's performance on key downstream domains.
From my notes here: https://simonwillison.net/2025/Jul/28/glm-45/
Am I missing something?
I know how to make an SD LoRA, and use it. I've known how to do that for 2 years. So what's the big secret about LLM LoRA?
There are Colab Notebook tutorials around training models with it as well.
I'm not sure if it contains exactly what you're looking for, but it includes several resources and notebooks related to fine-tuning LLMs (including LoRA) that I found useful.
I imagine with the finetunes you have to worry about self-hosting, model utilization, and then also retraining the model as new base models come out. I'm curious under what circumstances you've found that the benefits outweigh the downsides.
- (shameless plug) My company, Synthetic, supports LoRAs for Llama 3.1 8b and 70b: https://synthetic.new All you need to do is give us the Hugging Face repo and we take care of the rest. If you want other people to try your model, we charge usage to them rather than to you. (We can also host full finetunes of anything vLLM supports, although we charge by GPU-minute for full finetunes rather than the cheaper per-token pricing for supported base model LoRAs.)
- Together.ai supports a slightly wider number of base models than we do, with a bit more config required, and any usage is charged to you.
- Fireworks does the same as Together, although they quantize the models more heavily (FP4 for the higher-end models). However, they support Llama 4, which is pretty nice although fairly resource-intensive to train.
If you have reasonably good data for your task, and your task is relatively "narrow" (i.e. find a specific kind of bug, rather than general-purpose coding; extract a specific kind of data from legal documents rather than general-purpose reasoning about social and legal matters; etc), finetunes of even a very small model like an 8b will typically outperform — by a pretty wide margin — even very large SOTA models while being a lot cheaper to run. For example, if you find yourself hand-coding heuristics to fix some problem you're seeing with an LLM's responses, it's probably more robust to just train a small model finetune on the data and have the finetuned model fix the issues rather than writing hardcoded heuristics. On the other hand, no amount of finetuning will make an 8b model a better general-purpose coding agent than Claude 4 Sonnet.
So here's the original
https://web.archive.org/web/20231127123701/https://brev.dev/...
This one is very good in my opinion.
LM Studio (not exclusively, I'm sure) makes it a no-brainer to pick models that'll work on your hardware.
(*best performance/size ratio, generally if the model easily fits at q4 you're better off going to a higher parameter count than going for a larger quant, and vice versa)
... or if you have Apple hardware with their unified memory, whatever the assholes soldered in is your limit.
Also, IANAL but Space Invaders is owned IP. I have no idea the legality of a blog post describing steps to create and releasing an existing game, but I've seen headlines on HN of engs in trouble for things I would not expect to be problematic. Maybe Space Invaders is in q-tip/band-aid territory at this point?, but if this was Zelda instead of Space Invaders, I could see things being more dicey.
So is Tetris. And I believe that Snake is also an owned IP although I could be wrong on this one.
This isn't copyright infringement; it isn't based on the original assembly code or artwork. A game concept can't be copyrighted. Even if one of SI's game mechanics were patented, it would have long expired. Trade secret doesn't apply in this situation.
That leaves trademark. No reasonable person would be confused whether Simon is trying to pass this creation off as a genuine Space Invaders product.
There may be no reasonable confusion, but trademark holders also have to protect against dilution of their brand, if they want to retain their trademark. With use like this, people might come to think of Space Invaders as a generic term for all games of this type, not the brand of a specific game.
(there is a strong case to be made that they already do, granted)
My only memory issue that I can remember is an OBS memory leak, otherwise these MBPs incredible hardware. I wish any other company could actually deliver a comparable machine.
So a home workstation with 64GB+ of RAM could get similar results?
The neat thing about Apple Silicon is the system RAM is available to the GPU. On most other systems you would need ~48GB of VRAM.
https://www.reddit.com/r/GamingLaptops/comments/1akj5aw/what...
I personally want to run linux and feel like I'll get a better price/GB offering that way. But, it is confusing to know how local models will actually work on those and the drawbacks of iGPU.
If you want things to run quickly, then aside from Macs, there's the 2025 ASUS Flow z13 which (afaik) is the only laptop with AMD's new Ryzen Max+ 395 processor. This is powerful and has up to 128Gb of RAM that can be shared with the GPU, but they're very rare (and Mac-expensive) at the moment.
The other variable for running LLMs quickly is memory bandwidth; the Max+ 395 has 256Gb/s, which is similar to the M4 Pro; the M4 Max chips are considerably higher. Apple fell on their feet on this one.
Your 64gb workstation doesn't share the ram with your gpu.
Similar in quality, but CPU generation will be slower than what macs can do.
What you can do with MoEs (GLMs and Qwens) is to run some experts (the shared ones usually) on a GPU (even a 12GB/16GB will do) and the rest from RAM on CPU. That will speed things up considerably (especially prompt processing). If you're interested in this, look up llama.cpp and especially ik_llama, which is a fork dedicated to this kind of selective offloading of experts.
If you haven't seen that yourself yet I suggest firing up the free, no registration required GLM-4.5 Air on https://chat.z.ai/ and seeing if you can prove yourself wrong.
The impressive thing about these models is their ability to write working code, not their ability to come up with unique ideas. These LLMs actually can come up with unique ideas as well, though I think it’s more exciting that they can help people execute human ideas instead.
Either with SDL2+C, or even TCL/Tk, or Pythn with TKInter.
I'd like to see someone try to prove this. How many space invaders projects exist on the internet? I'd be hard to compare model "generated" code to everything out there looking for plagiarism, but I bet there are lots of snippets pulled in. These things are NOT smart, they are huge and articulate information repositories.
Based on my mental model of how these things work I'll be genuinely surprised if you can find even a few lines of code duplicated from one of those projects into the code that GLM-4.5 wrote for me.
animation: glow 2s ease-in-out infinite;
stuffed it verbatim into google and found a stack overflow discussion that contained this: animation: glow .5s infinite alternate;
in under one minute. Then I found this page of CSS effects:https://alvarotrigo.com/blog/animated-backgrounds-css/
Another page has examples and contains:
animation: float 15s infinite ease-in-out;
There is just too much internet to scan for an exact match or a match of larger size.That's what I expect these things to do: they break down Space Invaders into the components they need to build, then mix and match thousands of different coding patterns (like "animation: glow 2s ease-in-out infinite;") to implement different aspects of that game.
You can see that in the "reasoning" trace here: https://gist.github.com/simonw/9f515c8e32fb791549aeb88304550... - "I'll use a modern design with smooth animations, particle effects, and a retro-futuristic aesthetic."
That code certainly looks similar, but I have trouble imagining how else you would implement very basic collision detection between a projectile and a player object in a game of this nature.
More importantly, it is not just the collision check that is similar. Almost the entire sequence of operations is identical on a higher level:
1. enemyBullet/player collision check
2. same comment "// Player hit!" (this is how I found the code)
3. remove enemy bullet from array
4. decrement lives
5. update lives UI
6. (createParticle only exists in JS code)
7. if lives are <= 0, gameOver
> find even a few lines of code duplicated from one of those projects
I'm pretty sure they meant multiple lines copied verbatim from a single project implementing space invaders, rather than individual lines copied (or likely just accidentally identical) across different unrelated projects.
That's how you write css. The examples aren't the same at all, they just use the same css feature.
It feels like you aren't a coder--you've sabotaged your own point.
Compressing a few petabytes into a few gigabytes requires that they can't be like this about all of the things they're accused of simply copy-pasting, from code to newspaper articles to novels. There's not enough space.
> Write an HTML and JavaScript page implementing space invaders
It may not be "copy pasting" but it's generating output as best it can be recreated from its training on looking at Space Invaders source code.
The engineers at Taito that originally developed Space Invaders were not told "make Space Invaders" and then did their best to recall all the source code they've looked at in their life to re-type the source code to an existing game. From a logistics standpoint, where the source code already exists and is accessible, you may as well have copy-pasted it and fudged a few things around.
I used that prompt because it's the shortest possible prompt that tells the model to build a game with a specific set of features. If I wanted to build a custom game I would have had to write a prompt that was many paragraphs longer than that.
The aim of this piece isn't "OMG looks LLMs can build space invaders" - at this point that shouldn't be a surprise to anyone. What's interesting is that my laptop can run a model that is capable of that now.
Sure but that doesn’t impact the OPs point at all because there are numerous copies of reverse engineered source code available.
There are numerous copies of the reverse engineered source code already translated to JavaScript in your models training set.
I'm afraid no one cared much about your point :)
You'll only get "OMG look how good LLMs are they'll get us all fired!" comments and "LLMs suck" comments.
This is how it goes with religion...
It doesn't really matter whether or not the original code was published. In fact that original source code on its own probably wouldn't be that useful, since I imagine it wouldn't have tipped the weights enough to be "recallable" from the model, not to mention it was tasked with implementing it in web technologies.
It's like using an LLM to implement a red black tree. Red black trees are in the training data, so you don't need to explain or describe what you mean beyond naming it.
"Real engineering" with LLMs usually requires a bunch of up front work creating specs and outlines and unit tests. "Context engineering"
Most people won't bother with buying powerful hardware for this, they will keep using SAAS solutions, so Anthropic can be in trouble if cheaper SAAS solutions come out.
Most db in the early days you had to pay for. There are still for pay db that are just better than ones you don’t pay for. Some teams think that the cost is worth the improvements and there is a (tough) business there. Fortunes were made in the early days.
But eventually open source models became good enough for many use cases and they have their own advantages. So lots of teams use them.
I think coding models might have a similar trajectory.
My only feedback is: are these the same animal? Can we compare an O/S DB vs. paid/closed DB to me running an LLM locally? The biggest issue right now with LLMs is simply the cost of the hardware to run one locally, not the quality of the actual software (the model).
[1] e.g. SQL Server Express is good enough for a lot of tasks, and I guess would be roughly equivalent to the upcoming open versions of GPT vs. the frontier version.
Not that many projects are doing fully self-hosted RDBMS at this point. So ultimately proprietary databases still win out, they just (ab)use the Postgresql trademark to make people think they're using open source.
LLMs might go the same way. The big clouds offering proprietary fine tunes of models given away by AI labs using investor money?
I dislike running local LLMs right now because I find the software kinda janky still, you often have to tweak settings, find the right model files. Basically have a bunch of domain knowledge I don't have space for in my head. On top of maintaining a high-spec piece of hardware and paying for the power costs.
All it takes is some large companies commoditizing their complements. For Linux it was Google, etc. For AI it's Meta and China.
The only thing keeping Anthropic in business is geopolitics. If China were allowed full access to GPUs, they would probably die.
That said, for something like this, I’d probably get more out of simply finding an existing implementation on github or the like and downloading that.
When it comes to specialized and narrow domains like Space Invaders, the training set is likely to be extremely small and the model's vector space will have limited room to generalize. You'll get code that is more or less identical to the original source and you also have to wait for it to 'type' the code and the value add seems very low. I would rather ask it to point me to known Space Invaders implementations in language X on github (or search there).
Note that ChatGPT gets very nervous if I put this into GPT to clean up the grammar. It wants very badly for me to stress that LLMs don't memorize and overfitting is very unlikely (I believe neither).
Where a Mac may beat the above is on the memory side, if a model requires more than 24/32 GB of GPU memory you are usually better off with a Mac with 64/128 GB of RAM. On a Mac the memory is shared between CPU and GPU, so the GPU can load larger models.
1. 2-4x 3090+ nvidia cards. Some are getting Chinese 48GB cards. There is a ceiling to vRAM that prevents the biggest models from being able to load, most can run most quants at great speeds
2. Epyc servers running CPU inference with lots of RAM at as high of memory bandwidth as is available. With these setups people are getting like 5-10 t/s but are able to run 450B parameter models.
3. High RAM Macs with as much memory bandwidth as possible. They are the best balanced approach and surprisingly reasonable relative to other options.
It can really help you figure a ton of things out before you blow the cash on your own hardware.
0.7 * 8 * ((20 * 12) - 8 - 14) * 3 = $3662
I bought my RTX 4090 for about $2200. I also had the pleasure of being able to use it for gaming when I wasn't working. To be fair, the VRAM requirements for local models keeps climbing and my 4090 isn't able to run many of the latest LLMs. Also, I omitted cost of electricity for my local LLM server cost. I have not been measuring total watts consumed by just that machine.
One nice thing about renting is that it give you flexibility in terms of what you want to try.
If you're really looking for the best deals look at 3rd party hosts serving open models for the API-based pricing, or honestly a Claude subscription can easily be worth it if you use LLMs a fair bit.
2. I basically agree with your caveats - excluding electricity is a pretty big exclusion and I don't think that you've had 3 years of really high-value self-hostable models, I would really only say the last year and I'm somewhat skeptical of how good for ones that can be hosted in 24gb vram. 4x4090 is a different story.
FWIW GPU aside, my PC isn't particularly new - it is a 5-6 year old PC that was the cheapest money could buy originally and became "decent" at the time i upgraded it ~5 years ago and i only added the GPU around Christmas as prices were dropping since AMD was about to release the new GPUs.
[0] https://i.imgur.com/FevOm0o.png
[1] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92...
Looking forward to trying this with Aider.
Surely this must exist, no? I want to generate a local leaderboard and perhaps write new test cases.
Definitely a ton of things I learned about how to "develop" "with" AI along the way.
My MacBook has 16GB of RAM and it is from a period when everyone was fiercely insisting that 8GB base model is all I'll ever need.
I would hope an LLM could spit out a cobbled form of answer to a common interview question.
Today a colleague presented data changes and used an LLM to build a display app for the JSON for presentation. Why did they not just pipe the JSON into our already working app that displays this data?
People around me for the most part are using LLMs to enhance their presentations, not to actually implement anything useful. I have been watching my coworkers use it that way for months.
Another example? A different coworker wanted to build a document macro to perform bulk updates on courseware content. Swapping old words for new words. To build the macro they first wrote a rubrick to prompt an LLM correctly inside of a word doc.
That filled rubrik is then used to generate a program template for the macro. To define the requirements for the macro the coworker then used a slideshow slide to list bullet points of functionality, in this case to Find+Replace words in courseware slides/documents using a list of words from another text document. Due to the complexity of the system, I can’t believe my colleague saved any time. The presentation was interesting though and that is what they got compliments on.
However the solutions are absolutely useless for anyone else but the implementer.
If I'm writing code for production systems using LLMs I still review every single line - my personal rule is I need to be able to explain how it works to someone else before I'm willing to commit it.
I wrote a whole lot more about my approach to using LLMs to help write "real" code here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/
Also the 'if' doesn't negate anything as they say "I still", meaning the behavior is actively happening or ongoing; they don't use a hypothetical or conditional after "still", as in "I still would".
https://news.ycombinator.com/newsguidelines.html
Edit: twice is already a pattern - https://news.ycombinator.com/item?id=44110785. No more of this, please.
Edit 2: I only just realized that you've been frequently posting abusive replies in a way that crosses into harangue if not harassment:
https://news.ycombinator.com/item?id=44725284 (July 2025)
https://news.ycombinator.com/item?id=44725227 (July 2025)
https://news.ycombinator.com/item?id=44725190 (July 2025)
https://news.ycombinator.com/item?id=44525830 (July 2025)
https://news.ycombinator.com/item?id=44441154 (July 2025)
https://news.ycombinator.com/item?id=44110817 (May 2025)
https://news.ycombinator.com/item?id=44110785 (May 2025)
https://news.ycombinator.com/item?id=44018000 (May 2025)
https://news.ycombinator.com/item?id=44008533 (May 2025)
https://news.ycombinator.com/item?id=43779758 (April 2025)
https://news.ycombinator.com/item?id=43474204 (March 2025)
https://news.ycombinator.com/item?id=43465383 (March 2025)
https://news.ycombinator.com/item?id=42960299 (Feb 2025)
https://news.ycombinator.com/item?id=42942818 (Feb 2025)
https://news.ycombinator.com/item?id=42706415 (Jan 2025)
https://news.ycombinator.com/item?id=42562036 (Dec 2024)
https://news.ycombinator.com/item?id=42483664 (Dec 2024)
https://news.ycombinator.com/item?id=42021665 (Nov 2024)
https://news.ycombinator.com/item?id=41992383 (Oct 2024)
That's abusive, unacceptable, and not even a complete list!
You can't go after another user like this on HN, regardless of how right you are or feel you are or who you have a problem with. If you keep doing this, we're going to end up banning you, so please stop now.
You can validate this pretty easily by asking some logic or coding questions: you will likely note that a final output is not necessarily the logical output of the end of the thinking; sometimes significantly orthogonal to it, or returning to reasoning in the middle.
All that to say - good idea to read it, but stay vigilant on outputs.
(I'm experienced at reading and reviewing code.)
My ability to write with a pen has suffered enormously now that I do most of my writing on a phone or laptop - but I'm writing way more.
I expect I'll become slower at writing code without an LLM, but the volume of (useful) code I produce will be worth the trade off.
Disposable code is where AI shines.
AI generating the boilerplate code for an obtuse build system? Yes, please. AI generating an animation? Ganbatte. (Look at how much work 3Blue1Brown had to put into that--if AI can help that kind of thing, it has my blessings). AI enabling someone who doesn't program to generate some prototype that they can then point at an actual programmer? Excellent.
This is fine because you don't need to understand the result. You have a concrete pass/fail gate and don't care about underneath. This is real value. The problem is that it isn't gigabuck value.
The stuff that would be gigabuck value is unfortunately where AI falls down. Fix this bug in a product. Add this feature to an existing codebase. etc.
AI is also a problem because disposable code is what you would assign to junior programmers in order for them to learn.
Is this kind of thing also possible with one of these self-hosted models in a comparable way, or are they mostly good for coding?
This is another example of LLMs being dumb copiers that do understand human prompts.
But there is one positive side to this: If this photocopying business can be run locally, the stocks of OpenAI etc. should got to zero.
Also, you know it they fail they could say so instead of giving a hallucinated answer. First the models lie and say that a 20 digit number takes vast amounts of computing. Then, if pointed to a factorization program they pretend to execute it and lie about the output.
There is no intelligence or flexibility apart from stealing other people's open source code.
I don't think it's ever been accurate.
People are going to explore and get comfortable with alternatives.
There may have been other ways to deal with the cases they were worried about.
> My 2.5 year old with their laptop can write Space Invaders
For a few hundred milliseconds there I was thinking "these damn kids are getting good with tablets"
But I am without my glasses, but still I have hackernews at 250%, I think I am a little cooked lol.
I suppose that it could be intended to be read as "my laptop is only 2.5 years old, and therefore fairly modern/powerful" but I doubt that was the intention.
This makes it a great way to illustrate how much better the models have got without requiring new hardware to unlock those improved abilities.
It speaks to the advancements in models that aren't just throwing more compute/ram at it.
Also, his laptop isn't that fancy.
> It claims to be small enough to run on consumer hardware. I just ran the 7B and 13B models on my 64GB M2 MacBook Pro!
I believe we are vastly underestimating what our existing hardware is capable of in this space. I worry that narratives like the bitter lesson and the efficient compute frontier are pushing a lot of brilliant minds away from investigating revolutionary approaches.
It is obvious that the current models are deeply inefficient when you consider how much you can decimate the precision of the weights post-training and still have pelicans on bicycles, etc.
Could today’s consumer hardware run a future superintelligence (or, as a weaker hypothesis, at least contain some lower-level agent that can bootstrap something on other hardware via networking or hyperpersuasion) if the binary dropped out of a wormhole?
I think much of our progress is limited by the capacity of the human brain, and we mostly proceed via abstraction which allows people to focus on narrow slices. That abstraction has a cost, sometimes a high one, and it’s interesting to think about what the full potential could be without those limitations.
A concise description of the right abstractions for our universe is probably not too far removed from the weights of a superintelligence, modulo a few transformations :)
The biggest question I keep asking myself - What is the Kolmogorov complexity of a binary image that provides the exact same capabilities as the current generation LLMs? What are the chances this could run on the machine under my desk right now?
I know how many AAA frames per second my machine is capable of rendering. I refuse to believe the gap between running CS2 at 400fps and getting ~100b/s of UTF8 text out of a NLP black box is this big.
That's not a good measure. NP problem solutions are only a single bit, but they are much harder to solve than CS2 frames for large N. If it could solve any problem perfectly, I would pay you billions for just 1b/s of UTF8 text.
Exactly. This is what compels me to try.
This new one from Qwen should fit though - it looks like that only needs ~30GB of RAM: https://huggingface.co/lmstudio-community/Qwen3-30B-A3B-Inst...
~30GB in Q8 sure, but it's a minimal gain for double the VRAM usage.
https://jsbin.com/lejunenezu/edit?html,output
Its pelican was a total fail though.
This surprises me, I thought it would be simpler than Space Invaders.
I tried the "Write an HTML and JavaScript page implementing space invaders" prompt against it and didn't quite get a working game with a single shot, but it was still an interesting result: https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct...
Though I suppose, given a few years, that may also be true!
Chat is here: https://claude.ai/share/dc9eccbf-b34a-4e2b-af86-ec2dd83687ea
Claude Opus 4 does work but is far behind of Simon's GLM-4.5: https://claude.ai/share/5ddc0e94-3429-4c35-ad3f-2c9a2499fb5d
64gb is pure RAM? I thought Apple Silicon was efficient at paging SSD as memory storage - how important is RAM if you've got a fast SSD?
croes•11h ago
jplrssn•11h ago
quantumHazer•10h ago
diggan•10h ago
This is why I keep all my benchmarks private and don't share anything about them publicly, as soon as you write about them anywhere publicly they'll stop being useful in some months.
toyg•10h ago
This is also why, if I were an artist or anyone commercially relying on creative output of any kind, I wouldn't be posting anything on the internet anymore, ever. The minute you make anything public, the engines will clone it to death and turn it into a commodity.
__mharrison__•9h ago
toyg•9h ago
AI is definitely breaking the whole "labor for money" architecture of our world.
zhengyi13•8h ago
Maybe the thing to do is provide public, physical exhibits of your art in search of patronage.
debugnik•9h ago
I considered experimenting with web DRM for art sites/portfolios, on the assumption that scrappers won't bother with the analog loophole (and dedicated art-style cloners would hopefully be disappointed by the quality), but gave up because of limited compatible devices for the strongest DRM levels, and HDCP being broken on those levels anyway. If the DRM technique caught on it would take attackers, at most, a few bucks and hours once to bypass it, and I don't think users would truly understand that upfront.
simonw•10h ago
__mharrison__•9h ago
Simon probably wouldn't be happy about killing his multi-year evaluation metric though...
simonw•9h ago
My pelican on a bicycle benchmark is a long con. The goal is to finally get a good SVG of a pelican riding a bicycle, and if I can trick AI labs into investing significant effort in cheating on my benchmark then fine, that gets me my pelican!
gchamonlive•10h ago
I just think we should stop to appreciate exactly how awesome language models are. It's compressing and correctly reproducing a lot of data with meaningful context between each token and the rest of the context window. It's still amazing, specially with smaller models like this, because even if it's reproducing a clone, you can still ask questions about it and it should perform reasonably well explaining you what it does and how you can take it over to further develop that clone.
croes•8h ago
Like all these vibe coded to do apps, one of the most used starting problems of programming courses.
It’s great that an AI can do that but it could stall progress if we get limited to existing tools and programs.
shermantanktop•10h ago