I'll take one with a frontier model please, for my local coding and home ai needs..
Smaller models, not so much.
Show me something at a model size 80GB+ or this feels like "positive results in mice"
This is great even if it can't ever run Opus. Many little will be extremely happy about something like Phi at lightning speed.
10 reticle sized chips on TSMC N6. Basically 10x Nvidia H100 GPUs.
Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.
Interesting design for niche applications.
What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?
> Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.
Subtle detail here: the fastest turnaround that one could reasonably expect on that process is about six months. This might eventually be useful, but at the moment it seems like the model churn is huge and people insist you use this week's model for best results.
> The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.
https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...Video game NPCs?
"447 / 6144 tokens"
"Generated in 0.026s • 15,718 tok/s"
This is crazy fast. I always predicted this speed in ~2 years in the future, but it's here, now.Each chip is the size of an H100.
So 80 H100 to run at this speed. Can’t change the model after you manufacture the chips since it’s etched into silicon.
10 H100 chips for 3GB model.
I think it’s a niche of a niche at this point.
I’m not sure what optimization they can do since a transistor is a transistor.
I know it's not a resonating model, but I keep pushing it and eventually it gave me this as part of it's output
888 + 88 + 88 + 8 + 8 = 1060, too high... 8888 + 8 = 10000, too high... 888 + 8 + 8 +ประก 8 = 1000,ประก
I googled the strange symbol, it seems to mean Set in thai?
Tech summary:
- 15k tok/sec on 8B dense 3bit quant (llama 3.1)
- probably no KV cache
- 880mm^2 die, TSMC 6nm, 53B transistors
- presumably 200W per chip
- 20x cheaper to produce
- 10x less energy per token for inference
- max context size: flexible
- mid-sized thinking model upcoming this spring on same hardware
- next hardware supposed to be FP4
- a frontier LLM planned within twelve months
This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.
Not exactly a competitor for Nvidia but probably for 5-10% of the market.
With a bit of googleing and asking various AIs, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Also no info how well the speed scales with the model size.
And it’s a 3bit quant. So 3GB ram requirement.
If they run 8B using native 16bit quant, it will use 60 H100 sized chips.
Are you sure about that? If true it would definitely make it look a lot less interesting.
I assume they need all 10 chips for their 8B q3 model. Otherwise, they would have said so or they would have put a more impressive model as the demo.
https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
1. It doesn’t make sense in terms of architecture. It’s one chip. You can’t split one model over 10 identical hardwire chips
2. It doesn’t add up with their claims of better power efficiency. 2.4kW for one model would be really bad.
Not sure who started that "split into 10 chips" claim, it's just dumb.
This is Llama 3B hardcoded (literally) on one chip. That's what the startup is about, they emphasize this multiple times.
1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.
2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.
3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.
However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.
Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.
I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.
To the authors: do not self-deprecate your work. It is true this is not a frontier model (anymore) but the tech you've built is truly impressive. Very few hardware startups have a v1 as good as this one!
Also, for many tasks I can think of, you don't really need the best of the best of the best, cheap and instant inference is a major selling point in itself.
Or is that the catch? Either way I am sure there will be some niche uses for it.
Anyway VCs will dump money onto them, and we'll see if the approach can scale to bigger models soon.
Asides from the obvious concern that this is a tiny 8B model, I'm also a bit skeptical of the power draw. 2.4 kW feels a little bit high, but someone else should try doing the napkin math compared to the total throughput to power ratio on the H200 and other chips.
Model intelligence is, in many ways, a function of model size. A small model well fit for a given domain is still crippled by being small.
Some things don't benefit from general intelligence much. Sometimes a dumb narrow specialist really is all you need for your tasks. But building that small specialized model isn't easy or cheap.
Engineering isn't free, models tend to grow obsolete as the price/capability frontier advances, and AI specialists are less of a commodity than AI inference is. I'm inclined to bet against approaches like this on a principle.
The idea is good though and could work.
It's a bad idea that can't work well. Not while the field is advancing the way it is.
Manufacturing silicon is a long pipeline - and in the world of AI, one year of capability gap isn't something you can afford. You build a SOTA model into your chips, and by the time you get those chips, it's outperformed at its tasks by open weights models half their size.
Now, if AI advances somehow ground to a screeching halt, with model upgrades coming out every 4 years, not every 4 months? Maybe it'll be viable. As is, it's a waste of silicon.
An LLM's effective lifespan is a few months (ie the amount of time it is considered top-tier), it wouldn't make sense for a user to purchase something that would be superseded in a couple of months.
An LLM hosting service however, where it would operate 24/7, would be able to make up for the investment.
[1]: https://artificialanalysis.ai/models/llama-3-1-instruct-8b/p...
…for a privileged minority, yes, and to the detriment of billions of people whose names the history books conveniently forget. AI, like past technological revolutions, is a force multiplier for both productivity and exploitation.
It could give a boost to the industry of electron microscopy analysis as the frontier model creators could be interested in extracting the weights of their competitors.
The high speed of model evolution has interesting consequences on how often batches and masks are cycled. Probably we'll see some pressure on chip manufacturers to create masks more quickly, which can lead to faster hardware cycles. Probably with some compromises, i.e. all of the util stuff around the chip would be static, only the weights part would change. They might in fact pre-make masks that only have the weights missing, for even faster iteration speed.
If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.
This can give huge wafers for a very set model which is old by the time it is finalized.
Etching generic functions used in ML and common fused kernels would seem much more viable as they could be used as building blocks.
So what's the use case for an extremely fast small model? Structuring vast amounts of unstructured data, maybe? Put it in a little service droid so it doesn't need the cloud?
Everyone in Capital wants the perpetual rent-extraction model of API calls and subscription fees, which makes sense given how well it worked in the SaaS boom. However, as Taalas points out, new innovations often scale in consumption closer to the point of service rather than monopolized centers, and I expect AI to be no different. When it’s being used sparsely for odd prompts or agentically to produce larger outputs, having local (or near-local) inferencing is the inevitable end goal: if a model like Qwen or Llama can output something similar to Opus or Codex running on an affordable accelerator at home or in the office server, then why bother with the subscription fees or API bills? That compounds when technical folks (hi!) point out that any process done agentically can instead just be output as software for infinite repetition in lieu of subscriptions and maintained indefinitely by existing technical talent and the same accelerator you bought with CapEx, rather than a fleet of pricey AI seats with OpEx.
The big push seems to be building processes dependent upon recurring revenue streams, but I’m gradually seeing more and more folks work the slop machines for the output they want and then put it away or cancel their sub. I think Taalas - conceptually, anyway - is on to something.
The sheer speed of how fast this thing can “think” is insanity.
What type of latency-sensitive applications are appropriate for a solution like this? I presume this type of specialization is necessary for robotics, drones, or industrial automation. What else?
Sounds like people drinking the Kool-Aid now.
I don't reject that AI has use cases. But I do reject that it is promoted as "unprecedented amplifier" of human xyz anything. These folks would even claim how AI improves human creativity. Well, has this been the case?
I'm progressing with my side projects like I've never before.
notenlish•1h ago