Show HN: Z80-μLM, a 'Conversational AI' That Fits in 40KB

514•quesomaster9000•1mo ago

How small can a language model be while still doing something useful? I wanted to find out, and had some spare time over the holidays.

Z80-μLM is a character-level language model with 2-bit quantized weights ({-2,-1,0,+1}) that runs on a Z80 with 64KB RAM. The entire thing: inference, weights, chat UI, it all fits in a 40KB .COM file that you can run in a CP/M emulator and hopefully even real hardware!

It won't write your emails, but it can be trained to play a stripped down version of 20 Questions, and is sometimes able to maintain the illusion of having simple but terse conversations with a distinct personality.

The extreme constraints nerd-sniped me and forced interesting trade-offs: trigram hashing (typo-tolerant, loses word order), 16-bit integer math, and some careful massaging of the training data meant I could keep the examples 'interesting'.

The key was quantization-aware training that accurately models the inference code limitations. The training loop runs both float and integer-quantized forward passes in parallel, scoring the model on how well its knowledge survives quantization. The weights are progressively pushed toward the 2-bit grid using straight-through estimators, with overflow penalties matching the Z80's 16-bit accumulator limits. By the end of training, the model has already adapted to its constraints, so no post-hoc quantization collapse.

Eventually I ended up spending a few dollars on Claude API to generate 20 questions data (see examples/guess/GUESS.COM), I hope Anthropic won't send me a C&D for distilling their model against the ToS ;P

But anyway, happy code-golf season everybody :)

Comments

Zee2•1mo ago

This is super cool. Would love to see a Z80 simulator set up with these examples to play with!

Imustaskforhelp•1mo ago

100% Please do this! I wish the same

dmd•1mo ago

https://3e.org/private/z80ulmweb/

It's just one-shot AI slop - literally, the prompt was 'make a web based version of [github url of this project]' and it spat this out. It appears to work fine.

I'll keep it up for a couple of months and then it'll be auto-deleted, no sense in keeping it around longer than that.

jasonjmcghee•1mo ago

For future projects and/or for this project, there are many LLMs available more than good enough to generate that kind of synthetic data (20 Qs) with permissive terms of use. (So you don’t need to stress about breaking TOS / C&D etc)

codetiger•1mo ago

Imagine, this working on a Gameboy, in those days. Would've sounded like magic

alfiedotwtf•1mo ago

And would have lasted 3 minutes.

Speaking of - I remember my first digital camera (Fujitsu 1Mb resolution using SmartMedia)… it used so much power that you could take 20-30 photos and then needed to replace all 4 batteries lol

Sharlin•1mo ago

I don’t think this could beat an ELIZA-style bot in how magical it feels, given the extreme terseness of its replies.

lodovic•1mo ago

I love these thought experiments. Looking at the code size, it would have been possible for someone to come up with this back in the days, similar to the idea of a million monkeys on a typewriter eventually producing Shakespeare.

numpad0•1mo ago

Flip phones had predictive texts since forever. LLMs are just* supercharged predi[ctive text algorithms are computer algorithms that are]

qingcharles•1mo ago

"Look, my Game Boy passes the Turing Test!"

*burns you at the stake*

alfiedotwtf•1mo ago

An LLM in a .com file? Haha made my day

teaearlgraycold•1mo ago

SLM

quesomaster9000•1mo ago

All the 'Small' language models and the 'TinyML' scene in general tend to bottom out at a million parameters, hence I though 'micro' is more apt at ~150k params.

roygbiv2•1mo ago

Awesome. I've just designed and built my own z80 computer, though right now it has 32kb ROM and 32kb RAM. This will definitely change on the next revision so I'll be sure to try it out.

wewewedxfgdf•1mo ago

RAM is very expensive right now.

tgv•1mo ago

We're talking kilobytes, not gigabytes. And it isn't DDR5 either.

boomlinde•1mo ago

Yeah, even an average household can afford 40k of slow DRAM if they cut down on luxuries like food and housing.

wewewedxfgdf•1mo ago

Maybe the rich can but not all retro computer enthusiasts are rich.

charcircuit•1mo ago

If you can afford to spend a few dollars without sacrificing housing or food, you are being financial irresponsible.

ant6n•1mo ago

Busy cut down on the avocado toast!

nrhrjrjrjtntbt•1mo ago

Then I can afford eggs, ram and a studio appartment!

lacoolj•1mo ago

Maybe in Ohio

fuzzfactor•1mo ago

No apartment then, maybe just green, eggs, and RAM.

StilesCrisis•1mo ago

thats-the-joke.gif

wickedsight•1mo ago

I just removed 128 megs of RAM from an old computer and am considering listing it on eBay to pay off my mortgage.

nrhrjrjrjtntbt•1mo ago

I wonder what year past 128M ram would pay off mortgage. Maybe 1985

vedmakk•1mo ago

If one would train an actual secret (e.g. a passphrase) into such a model, that a user would need to guess by asking the right questions. Could this secret be easily reverse engineered / inferred by having access to models weights - or would it be safe to assume that one could only get to the secret by asking the right questions?

Kiboneu•1mo ago

I don’t know, but your question reminds me of this paper which seems to address it on a lower level: https://arxiv.org/abs/2204.06974

“Planting Undetectable Backdoors in Machine Learning Models”

“ … On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees. …”

ronsor•1mo ago

> this secret be easily reverse engineered / inferred by having access to models weights

It could with a network this small. More generally this falls under "interpretability."

nineteen999•1mo ago

This couldn't be more perfectly timed .. I have an Unreal Engine game with both VT100 terminals (for running coding agents) and Z80 emulators, and a serial bridge that allows coding agents to program the CP/M machines:

https://i.imgur.com/6TRe1NE.png

Thank you for posting! It's unbelievable how someone sometimes just drops something that fits right into what you're doing. However bizarre it seems.

quesomaster9000•1mo ago

Oh dear, it seems we've... somehow been psychically linked...

I developed a browser-based CP/M emulator & IDE: https://lockboot.github.io/desktop/

I was going to post that instead, but wanted a 'cool demo' instead, and fell down the rabbit hole.

jaak•1mo ago

I've been playing the Z80-μLM demos in your CP/M emulator. Works great! However, I have yet to guess a correct answer in GUESS.COM! I'm not sure if I'm just not asking the right questions or I'm just really bad at it!

quesomaster9000•1mo ago

Don't tell anybody, but you sit on it

sailfast•1mo ago

Boris!!!

stevekemp•1mo ago

That is beautiful.

I wrote a console-based emulator, and a simple CP/M text-adventure game somewhat recently

https://github.com/skx/cpmulator/

At some point I should rework my examples/samples to become a decent test-suite for CP/M emulators. There are so many subtle differences out there.

It seems I could even upload a zipfile of my game, but the escape-codes for clearing the screen don't work, sadly:

https://github.com/skx/lighthouse-of-doom

nineteen999•1mo ago

Haha I love it. Just imagine if instead of DOS-based Windows, a CP/M based alternative evolved and took over the PC industry. Nice one!

sixtyj•1mo ago

Connections: Alternative History of Technology by James Burke documents these "coincidences".

TeMPOraL•1mo ago

Those "coincidences" in Connections are really no coincidence at all, but path dependence. Breakthrough advance A is impossible or useless without prerequisites B and C and economic conditions D, but once B and C and D are in place, A becomes obvious next step.

embedding-shape•1mo ago

Some of those really are coincidences, like "Person A couldn't find their left shoe and ended up in London at a coffee house, where Person B accidentally ended up when their carriage hit a wall, which lead to them eventually coming up with Invention C" for example.

Although from what I remember from the TV show, most of what he investigates/talks about is indeed path dependence in one way or another, although not everything was like that.

sixtyj•1mo ago

That’s why I’ve put the word in parentheses :)

simonjgreen•1mo ago

Super intrigued but annoyingly I can’t view imgur here

abanana•1mo ago

Indeed, part of me wants to not use imgur because we can't access it, but a bigger part of me fully supports imgur's decision to give the middle finger to the UK after our government's censorship overreach.

wizzwizz4•1mo ago

It was a really clever move on Imgur's part. Their blocking the UK has nothing to do with the Online Safety Act: it's a response to potential prosecution under the Data Protection Act, for Imgur's (alleged) unlawful use of children's personal data. By blocking the UK and not clearly stating why, people assume they're taking a principled stand about a different issue entirely, so what should be a scandal is transmuted into positive press.

homebrewer•1mo ago

It blocks many more countries than just the UK because it's the lowest effort way of fighting "AI" scrapers.

imgur was created as a sort of protest against how terrible most image hosting platforms were back then, went down the drain several years later, and it's now just like they were.

supern0va•1mo ago

It turns out that running free common internet infrastructure at scale is both hard and expensive, unfortunately. What we really need is a non-profit to run something like imgur.

Dwedit•1mo ago

In before AI companies buy up all the Z80s and raise the prices to new heights.

nubinetwork•1mo ago

Too late, they stopped being available last year.

whobre•1mo ago

Kind of. There’s still eZ80

pdyc•1mo ago

interesting, i am wondering how far can it go if we remove some of these limitations but try to solve some extremely specific problem like generating regex based on user input? i know small models(270M range) can do that but can it be done in say < 10MB range?

Waterluvian•1mo ago

Generate an LLM that is designed to solve one extremely specific problem: answering the ultimate question of life, the universe, and everything.

Even with modern supercomputing the computation would be outpaced by the heat death of the universe, so token output must be limited to a single integer.

nrhrjrjrjtntbt•1mo ago

00101010

dirkt•1mo ago

Eliza's granddaughter.

a_t48•1mo ago

Nice - that will fit on a Gameboy cartridge, though bank switching might make it super terrible to run. Each bank is only 16k. You can have a bunch of them, but you can only access one bank at a time (well, technically two - bank 0 is IIRC always accessible).

ant6n•1mo ago

You have 32KB of ROM, plus 8 Kb of ram on original game boy. Game boy color has more. Bank switching is super fast, as well. Given that models are likely streamed, I doubt the bank switching is a problem.

Biggest pain point is likely the text input.

ColonelPhantom•1mo ago

Each layer of the LM is also at most 16 KiB, so if you want to minimize bank switching, I think making sure each layer is in one bank would be enough? Bank switching shouldn't give much overhead anyway unless it complicates an inner loop, which would be avoided if no layers are split across banks.

magicalhippo•1mo ago

As far as I know, the last layer is very quantization-sensitive, and is typically not quantized, or quantized lightly.

Have you experimented with having it less quantized, and evaluated the quality drop?

Regardless, very cool project.

kouteiheika•1mo ago

(Not OP)

It depends on the model, but from my experiments (quantizing one layer of a model to 2-bit and then training the model with that layer in 2-bit to fix the damage) the first layer is the most sensitive, and yes, the last layer is also sensitive too. The middle layers take the best to quantization.

Different components of a layer also have a different sensitivity; e.g. the MLP downscale block damages the model the most when quantized, while quantizing the Q projection in self attention damages the model the least.

Zardoz84•1mo ago

Meanwhile, Eliza was ported to BASIC and was run on many home computers in the 80s.

anonzzzies•1mo ago

Luckily I have a very large amount of MSX computers, zx, amstrad cpc etc and even one multiprocessor z80 cp/m machine for the real power. Wonder how gnarly this is going to perform with bankswitching though. Probably not good.

vatary•1mo ago

It's pretty obvious this is just a stress test for compressing and running LLMs. It doesn't have much practical use right now, but it shows us that IoT devices are gonna have built-in LLMs really soon. It's a huge leap in intelligence—kind of like the jump from apes to humans. That is seriously cool.

acosmism•1mo ago

i'll echo that practicality only surfaces once it is apparent what can be done. yea this feels like running "DOOM on pregnancy test devices" type of moment

orbital-decay•1mo ago

Pretty cool! I wish free-input RPGs of old had fuzzy matchers. They worked by exact keyword matching and it was awkward. I think the last game of that kind (where you could input arbitrary text when talking to NPCs) was probably Wizardry 8 (2001).

NooneAtAll3•1mo ago

did you measure token/s?

Peteragain•1mo ago

There are two things happening here. A really small LLM mechanism which is useful for thinking about how the big ones work, and a reference to the well known phenomenon, commonly dismissively referred to as a "trick", in which humans want to believe. We work hard to account for what our conversational partner says. Language in use is a collective cultural construct. By this view the real question is how and why we humans understand an utterance in a particular way. Eliza, Parry, and the Chomsky bot at http://chomskybot.com work on this principle. Just sayin'.

nrhrjrjrjtntbt•1mo ago

MAYBE

cwmoore•1mo ago

Universally correct reply, although honestly a bit vague.

Peteragain•1mo ago

Fair. The background reading is the EMCA stuff - conversation analysis cf Sacks etc at, and Ethnomethods (Garfunkel). And Vygotsky cf Kozulin. People such as Robert Moore at IBM and Lemon at Herriot-Watt work in this space but there is no critical mass in the face of LLM mania.

Peteragain•1mo ago

And the Chomskybot analysis is quite enlightening..

rahen•1mo ago

I love it, instant Github star. I wrote an MLP in Fortran IV for a punched card machine from the sixties (https://github.com/dbrll/Xortran), so this really speaks to me.

The interaction is surprisingly good despite the lack of attention mechanism and the limitation of the "context" to trigrams from the last sentence.

This could have worked on 60s-era hardware and would have completely changed the world (and science fiction) back then. Great job.

noosphr•1mo ago

Stuff like this is fascinating. Truly the road not taken.

Tin foil hat on: i think that a huge part of the major buyout of ram from AI companies is to keep people from realising that we are essentially at the home computer revolution stage of llms. I have a 1tb ram machine which with custom agents outperforms all the proprietary models. It's private, secure and won't let me be motetized.

Zacharias030•1mo ago

how so? sound like you are running Kimi K2 / GLM? What agents do you give it and how do you handle web search and computer use well?

jacquesm•1mo ago

Between this and RAM prices Zilog stock must be up! Awesome hack. Now apply the same principles to a laptop and take a megabyte or so, see what that does.

andrepd•1mo ago

We should show this every time a Slack/Teams/Jira engineer tries to explain to us why a text chat needs 1.5GB of ram to start up.

dangus•1mo ago

> It won't write your emails, but it can be trained to play a stripped down version of 20 Questions, and is sometimes able to maintain the illusion of having simple but terse conversations with a distinct personality.

You can buy a kid’s tiger electronics style toy that plays 20 questions.

It’s not like this LLM is bastion of glorious efficiency, it’s just stripped down to fit on the hardware.

Slack/Teams handles company-wide video calls and can render anything a web browser can, and they run an entire App Store of apps, all from a cross-platform application.

Including Jira in the conversation doesn’t even make logical sense. It’s not a desktop application that consumes memory. Jira has such a wide scope that the word “Jira” doesn’t even describe a single product.

messe•1mo ago

> can render anything a web browser can

That's a bug not a feature, and strongly coupled to the root cause for slack's bloat.

dangus•1mo ago

One person’s “bloat” is another person’s “critical business feature.”

The app ecosystem of Slack is largely responsible for its success. You can extend it to do almost anything you want.

spopejoy•1mo ago

> app ecosystem of Slack is largely responsible for its success.

Is that true? Slack was one of the first private chats that was not painful to use, circa 2015. I personally hate the integrations and wish they'd just fix the bugs in their core product.

andrepd•1mo ago

My Pentium 3 in 2005 could do chat and video calls and play chess and send silly emotes. There is no conceivable user-facing reason why in 20 years the same functionality takes 30× as many resources, only developer-facing reasons. But those are not valid reasons for a professional. If a bridge engineer claims he now needs 30× as much concrete to build the same bridge as he did 20 years ago, and the reason is his/her own conveinence, that would not fly.

ben_w•1mo ago

> If a bridge engineer claims he now needs 30× as much concrete to build the same bridge as he did 20 years ago, and the reason is his/her own conveinence, that would not fly.

By itself, I would agree.

However, in this metaphor, concrete got 15x cheaper in the same timeframe. Not enough to fully compensate for the difference, but enough that a whole generation are now used to much larger edifices.

andrepd•1mo ago

So it means you could save your client 93% of their money in concrete, but you choose to make it 2× more expensive! That only makes my metaphor stronger ahaha.

beagle3•1mo ago

But also, there is more traffic on the bridge.

The word processors of 30 years ago often had limits like “50k chapters” and required “master documents” for anything larger. Lotus 123 had much fewer columns or rows than modern excel.

Not an excuse, of course, but the older tools are not usable anymore if you have modern expectations.

ben_w•1mo ago

You could save 93% of the money in concrete, at the cost of ???* in the more-expensive-than-ever time of the engineer themselves who now dominates the sticker price.

(At this point the analogy breaks down because who pays for the software being slower is the users' time, not the taxes paid by a government buying a bridge from a civil engineer…)

* I don't actually buy the argument that the last decade or so of layers of "abstraction" save us developers any time at all, rather I think they're now several layers deep of nested inner platforms that each make things more complicated, but that's a separate entire thread, and blog post: https://benwheatley.github.io/blog/2024/04/07-21.31.19.html

kiicia•1mo ago

But it only shows how wasteful your new bridge is. Concrete being cheaper does not mean you somehow need to use more of it.

dangus•1mo ago

I have great doubts that you were doing simultaneous screen sharing from multiple participants with group annotation plus HD video in your group calls, all while supporting chatting that allowed you to upload and view multiple animated gifs, videos, rich formatted text, reactions, slash command and application automation integrations, all simultaneously on your Pentium 3.

I would be interested to know the name of the program that did all that within the same app during that time period.

For some reason Slack gets criticism for being “bloated” when it basically does anything you could possibly imagine and is essentially a business communication application platform. Nobody can actually name a specific application that does everything Slack does with better efficiency.

andrepd•1mo ago

You're grasping at anything to justify the unjustifiable. Not only did I do most (not all, obviously) of those things in my Pentium 3, including video and voice chat, screenshare, and silly animated gifs and rich text formatting, but also: that's beside the point. Let's compare like with like then; how much memory does it take to have a group chat with a few people and do a voice/video in MSN messenger or the original Skype, and how much does Slack or Teams take? What about UI stutter? Load time? There's absolutely no justification for a worse user experience in a 2025 computer that would be a borderline supercomputer in 2005.

dangus•1mo ago

You bring up apps like Skype doing similar work in 2005, but Skype was barely out of its 2003 public alpha by then. Version 2.0 beta came out in 2005 and was the first version to support video, and only supported video calling between two people.

And you bring up things that are supposedly bad about Slack that are basically non-existent boogeymen. UI stutter, load time, and excessive memory use, I can’t think of any time any of these things have existed at all or noticeably impacted my experience on Slack on a basic low end laptop.

Those older apps like MSN Messenger and the original Skype didn’t actually do the things that Slack does now. I mean specifically multiple simultaneous screen shares plus annotations plus HD video feeds (with important features like blurred and replaced backgrounds, added by Skype in 2019) for all participants plus running an entire productivity app in the background at the same time.

Skype didn’t have screen sharing, at all, until 2009.

https://content.dsp.co.uk/history-of-skype

You call this situation “unjustifiable” but we would struggle to find any personal computing device sold at any price point that can’t handle the application smoothly. If I go back five years and buy a $200 mini PC or a $300 iPad or $500 laptop it’s going to run Slack just fine.

Specs are just arbitrary numbers on a box. It doesn’t matter that we got to the moon using a turd and a ham sandwich for a computer.

You can’t accept that the layperson doesn’t care that back in my day we walked uphill both ways for 15 miles on our dial-up connection. If it works, it works.

ben_w•1mo ago

> Slack/Teams handles company-wide video calls and can render anything a web browser can, and they run an entire App Store of apps, all from a cross-platform application.

The 4th Gen iPod touch had 256 meg of RAM and also did those things, with video calling via FaceTime (and probably others, but I don't care). Well, except "cross platform", what with it being the platform.

dangus•1mo ago

Group FaceTime calls didn’t exist at the time. That wasn’t added until 2018 and required iOS 12.

Remember that Slack does simultaneous multiple participants screen sharing plus annotations plus HD video feeds from all participants plus the entirety of the rest of the app continues to function as if you weren’t on a call at all simultaneously.

It’s an extremely powerful application when you really step back and think about it. It just looks like “text” and boring business software.

ben_w•1mo ago

> Group FaceTime calls didn’t exist at the time. That wasn’t added until 2018 and required iOS 12.

And CU-SeeMe did that in the early 90s with even worse hardware: https://en.wikipedia.org/wiki/File:CU-Schools.GIF

Even more broadly, group calls were sufficiently widely implemented to get themselves standardised 29 years ago: https://en.wikipedia.org/wiki/H.323

> It’s an extremely powerful application when you really step back and think about it. It just looks like “text” and boring business software.

The *entire operating system of the phone* is more powerful, and ran on less.

dangus•1mo ago

Why don’t you just go ahead and tell me what specs you think Slack should run on and link me to an example program that has 100% feature parity that stays within those specs?

Showing me a black and white <10FPS group video call with no other accompanying software running simultaneously in the 90s is pointless.

Showing me that someone thought of a protocol is pointless. Just look at the history of HDTV. We wouldn’t really describe HDTV as being available to consumers despite it existing in the early 1990s.

I’d also like you to show me a laptop SKU sold in the last 10 years that is incapable of running Slack. If Slack is so inefficient you should be able to find me a computer that struggles with it.

Finally, I’ll remind you that Slack for mobile is a different application that isn’t running in the same way as the desktop app and uses fewer resources. The latest version of it will run on very old phone hardware, going all the way back to the iPhone 8 (2GB RAM), and that’s assuming you even need the latest version for it to function.

ben_w•1mo ago

> Why don’t you just go ahead and tell me what specs you think Slack should run on

1 Ghz processor, 512 MB RAM (might even manage 256 MB), 1080p monitor. And "a graphics accelerator", "a sound card", and "a webcam and microphone".

Probably even less on the RAM and CPU.

> and link me to an example program that has 100% feature parity that stays within those specs?

Windows 2000. Or XP.

That's the point. The OS supports all the apps needed to do whatever.

Making Slack into a monolithic blob to do all is just an example of the inner platform effect.

But if you insist: IE 7 would have been able to do all this. It's an app. It's also an example of the inner platform effect.

> Showing me a black and white <10FPS group video call with no other accompanying software running simultaneously in the 90s is pointless.

You should've thought of that before trying to "well akshually" me about which versions of FaceTime support multi-user video calling.

You want video calling? We had that 30 years ago on systems with total RAM smaller than current CPU cache, with internal busses whose bandwidth was less than your mobile's 5G signal, on screens smaller than the icon that has to be submitted to the App Store, with cameras roughly comparable to what we now use for optical mice, running over networks that were MacGyvered onto physical circuits intended for a single analogue voice signal.

Out of everything you list that Slack can do, the only thing that should even be remotely taxing is the HD video calling. Nothing else, at all. And the only reasons for even that to be taxing is correctly offloading work to the GPU and that you want HD. The GPU should handle this kind of thing trivially so long as you know how to use it.

All the "business logic" you mention in the other thread… if you can't handle the non-video business logic needed to be a server hosting 2000 simultaneous users on something with specs similar to a Raspberry Pi, you're not trying hard enough. I've done that. Business logic is the easy part for anything you can describe as "chat". Even if you add some minigames in there and the server is keeping track of the games, it should be a rounding error on a modern system.

fc417fc802•1mo ago

If these applications only hogged memory when under stress (outgoing screencap plus video, multiple streams incoming, display to 3+ monitors) you might have a point. But that's not the case so you don't.

Meanwhile I can play back multiple 1080 videos on different monitors, run a high speed curl download, saturate my gigabit LAN with a bulk transfer, and run a brrfs scrub in the background all most likely without breaking 2 GB of RAM usage. MPV, VLC, and ffmpeg are all remarkably lightweight.

The only daily application I run that consumes a noticable quantity of resources is my web browser.

dangus•1mo ago

If you didn’t babysit your task manager would you know which program used more RAM or not?

This argument is just so endless and tiring.

Saturating my bandwidth or running a btrfs scrub isn’t accomplishing the business logic I need to do my job, that’s what my web browser is doing.

fc417fc802•1mo ago

So is it the "business logic" or is it the multiple HD streams that are supposed to account for the resource consumption? You've changed your story. But do please explain how the "business logic" to handle the chat box, UI, and whatever else is supposed to justify the status quo.

People making excuses for poorly designed software is what's tiring.

numpad0•1mo ago

The problem with that kind of feature/benefit based thinking is that it won't correlate with code or computational footprints well. That's like justifying price of cars with seatback materials. That's not where the costs are.

Modern chat apps like Slack, Discord, Teams, etc. are extremely resource intense solely by being skinned Chrome showing overbloated HTMLs. That's it. Most of the "actual" engineering of it is outsourced and externalized to Google, NVIDIA/Intel/AMD, Microsoft/Apple, etc.

Y_Y•1mo ago

Very cool. Did you consider using sparse weights?

bytesandbits•1mo ago

it's giving Eliza! Ha, fun

bartread•1mo ago

This is excellent. Thing I’d like to do if I had time: get it running on a 48K Spectrum. 10 year old me would have found that absolutely magical back in the 1980s.

tomduncalf•1mo ago

This was my first thought too haha. That would be mind blowing

bartread•1mo ago

Yeah, very WarGames.

EDIT: Actually thinking about it some more…

- Imagine what you could do with 16-bit games of the era with one or more of these models embedded. Swap the model depending on the use case within the game. Great for adventures, RPGs, strategy, puzzle, and trading games (think Elite). With 512K or 1MB of RAM, plus 2 - 4 floppies (which became increasingly common as the era wore on), you could probably do a lot, especially if the outcomes of conversations can result in different game outcomes

- Back in the day nobody was really trying to do anything serious with AI on 8 or even most 16-bit machines, because nobody thought they were powerful enough to do anything useful with. Now the thinking has changed to how much somewhat useful intelligence can I cram into the least powerful device, even if that’s only for fun?

- Imagine showing this running on a CP/M machine, like the C128, to a serious AI researcher working back in the 1980s. Minds blown, right?

- Now spool forward 10 years into the 1990s and think what PC hardware of that era would have been capable of with these limited language models. I wonder what that era might have looked like with something that seems like somewhat useful conversational AI? A sort of electro-steampunk-ish vibe maybe? People having really odd conversations with semi-capable home automation running via their PCs.

gcanyon•1mo ago

So it seems like with the right code (and maybe a ton of future infrastructure for training?) Eliza could have been much more capable back in the day.

antonvs•1mo ago

The original ELIZA ran on an IBM 7094 mainframe, in the 1960s. That machine had 32K x 36-bit words, and no support for byte operations. It did support 6-bit BCD characters, packed 6 per word, but those were for string operations, and didn't support arithmetic or logical operations.

This means that a directly translated 40 KB Z80 executable might be a tight squeeze on that mainframe, because 40K > 32K, counting words, not bytes. Of course if most of that size is just 2-bit weight data then it might not be so bad.

ELIZA running on later hardware would have been a different story, with the Z80 - released in 1976 - being an example.

bitwize•1mo ago

Don't be surprised if you're paid a visit by the SCP Foundation: https://scp-wiki.wikidot.com/scp-079

(edit: change url)

coolius•1mo ago

This is impressive, those are some very restrictive requirements. I wonder what we are able to run on more powerful hardware such as ESP32 or RP2040, has anyone tried this?

gwern•1mo ago

So if it's not using attention and it processes the entire input into an embedding to process in one go, I guess this is neither a Transformer nor a RNN but just a MLP?

giancarlostoro•1mo ago

This is something I've been wondering about myself. What's the "Minimally Viable LLM" that can have simple conversations. Then my next question is, how much can we push it so it can learn from looking up data externally, can we build a tiny model with an insanely larger context window? I have to assume I'm not the only one who has asked or thought of these things.

Ultimately, if you can build an ultra tiny model that can talk and learn on the fly, you've just fully localized a personal assistant like Siri.

Dylan16807•1mo ago

For your first question, the LLM someone built in Minecraft can handle simple conversations with 5 million weights, mostly 8 bits.

I doubt it would be able to make good use of a large context window, though.

fho•1mo ago

You might be interested in RWKV: https://www.rwkv.com/

Not exactly "minimal viable", but a "what if RNNs where good for LLMs" case study.

-> insanely fast on CPUs

giancarlostoro•1mo ago

My personal idea revolves around "can I run it on a basic smartphone, with whatever the 'floor' for basic smartphones under lets say $300 is for memory (let's pretend RAM prices are normal).

Edit: The fact this runs on a Smartphone means it is highly relevant. My only thing is, how do we give such a model an "unlimited" context window, so it can digest as much as it needs. I know some models know multiple languages, I wouldnt be surprised if sticking to only English would reduce the model size / need for more hardware and make it even smaller / tighter.

andy12_•1mo ago

This is extremely similar to Karpathy's idea of a "cognitive core" [1]; an extremely small model with near-0 encyclopedic knowledge and basic reasoning and tool-use capabilities.

[1] https://x.com/karpathy/status/1938626382248149433

qingcharles•1mo ago

I think what's amazing to speculate is how we could have had some very basic LLMs in at least the 90s if we'd invented the tech previously. I wonder what the world would be like now if we had?

DrNosferatu•1mo ago

Awesome! Anyone for a port to the MSX?

A web version would also be cool.

integricho•1mo ago

Someone add it to collapseos please :)

MagicMoonlight•1mo ago

What I really want is a game where each of the NPCs has a tiny model like this, so you can actually talk to them.

GuB-42•1mo ago

I thought about this, chatbots existed well before LLMs (Eliza: 1966!) and the only time I have seen a commercially successful game with a (very simple) chatbot was Quake III Arena!

Quake 3 is probably the last game where you would expect a chatbot, as there are few games where storytelling matters less and it is a very little known feature, but Quake 3 bots can react to what you say in the chat, in addition to the usual taunts.

But that's the thing, Quake 3 can do it because it is inconsequential, in a story-driven game like a RPG, NPCs have a well defined spot in the story and gameplay, they tell you exactly what you need to know, as to not disrupt the flow of the story. Tell you too much, and they will spoil the big reveal, tell you too little, and you don't know what to do, tell you irrelevant details and you get lost chasing them. It has to be concise and to the point, so that those who don't really care know what to do to advance the story, but with enough flavor to make the world alive. It is really hard to find the right balance, and if in addition, you have to incorporate a chatbot, it borders on impossible.

It looks like a good idea on the surface, but it most likely isn't, unless it is clearly not part of the main gameplay loop, as in Quake 3.

Some people had some success using a (big) LLM as a DM in D&D, which I think is easier since it can make up the story as it advances, it is much harder to make up game elements in a computer RPG that are not programmed in.

boznz•1mo ago

Great work. What is your timeline to AGI ?

fuzzfactor•1mo ago

Can't possibly be further than just around the corner.

RustyRussell•1mo ago

I'm thinking early April?

lostmsu•1mo ago

Did you train the model with quantization awareness? How?

jrdres•1mo ago

It runs, but it would be very slow on actual hardware.

I tried on a cycle-accurate emulator of a TRS-80 Model I with Omikron CP/M mapper. Most Z-80 machines of the time were 4MHz, but the TRS-80 was only 1.77 MHz.

1. Type "GUESS", get question prompt.

2. User types: "Are you an animal?", ENTER key

3. Wait 25 seconds

4. Program prints "N"

5. Wait 20 seconds

6. Program prints "O"

7. Wait 23 seconds

8. Program prints linefeed, returns to question prompt

Total time to return 2-char answer to user's question: 1 min 9 sec or so. I bet a longer answer would take proportionally longer.

"The wonder isn't that it does it well, it's a wonder it does it at all."

gp2000•1mo ago

Though it'll still be kinda slow on a Model I, I've written an about 9 times faster Z-80 code for the network evaluation. I imagine the pull request will end up in the main depot but for now you can find it in https://github.com/gp48k/z80ai

I think I can do a little bit better; maybe 10% faster.

gp2000•1mo ago

Well, I was pessimistic. Just pushed an update that slightly more than doubles the execution speed with a PR to the main depot pending. It is very close to 20 times faster than the original.

I went back to Linux and it was a mistake

Octrafic – open-source AI-assisted API testing from the CLI

US Accuses China of Secret Nuclear Testing

Peacock. A New Programming Language

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

What to know about the software selloff

Show HN: Syntux – generative UI for websites, not agents

Microsoft appointed a quality czar. He has no direct reports and no budget

AI overlay that reads anything on your screen (invisible to screen capture)

Show HN: Seafloor, be up and running with OpenClaw in 20 seconds

Tesla turbine-inspired structure generates electricity using compressed air

State Department deleting 17 years of tweets (2009-2025); preservation needed

Learning to code, or building side projects with AI help, this one's for you

Effulgence RPG Engine [video]

Five disciplines discovered the same math independently – none of them knew

We Scanned an AI Assistant for Security Issues: 12,465 Vulnerabilities

Amazon no longer defend cloud customers against video patent infringement claims

Show HN: Medinilla – an OCPP compliant .NET back end (partially done)

How Does AI Distribute the Pie? Large Language Models and the Ultimatum Game

Resistance Infrastructure

Fire-juggling unicyclist caught performing on crossing

Restoring a lost 1981 Unix roguelike (protoHack) and preserving Hack 1.0.3

GPS and Time Dilation – Special and General Relativity

Show HN: Witnessd – Prove human authorship via hardware-bound jitter seals

Show HN: I built a clawdbot that texts like your crush

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

I went back to Linux and it was a mistake

Octrafic – open-source AI-assisted API testing from the CLI

US Accuses China of Secret Nuclear Testing

Peacock. A New Programming Language

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

What to know about the software selloff

Show HN: Syntux – generative UI for websites, not agents

Microsoft appointed a quality czar. He has no direct reports and no budget

AI overlay that reads anything on your screen (invisible to screen capture)

Show HN: Seafloor, be up and running with OpenClaw in 20 seconds

Tesla turbine-inspired structure generates electricity using compressed air

State Department deleting 17 years of tweets (2009-2025); preservation needed

Learning to code, or building side projects with AI help, this one's for you

Effulgence RPG Engine [video]

Five disciplines discovered the same math independently – none of them knew

We Scanned an AI Assistant for Security Issues: 12,465 Vulnerabilities

Amazon no longer defend cloud customers against video patent infringement claims

Show HN: Medinilla – an OCPP compliant .NET back end (partially done)

How Does AI Distribute the Pie? Large Language Models and the Ultimatum Game

Resistance Infrastructure

Fire-juggling unicyclist caught performing on crossing

Restoring a lost 1981 Unix roguelike (protoHack) and preserving Hack 1.0.3

GPS and Time Dilation – Special and General Relativity

Show HN: Witnessd – Prove human authorship via hardware-bound jitter seals

Show HN: I built a clawdbot that texts like your crush

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

Show HN: Z80-μLM, a 'Conversational AI' That Fits in 40KB

Comments