frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

NASA recalls dysfunction, emotions during Boeing's botched Starliner flight

https://www.reuters.com/business/aerospace-defense/nasa-chief-slams-boeing-agency-failures-botche...
1•JumpCrisscross•1m ago•0 comments

IEEE 802.19.3:Coexistence Recommendations for sub 1-GHz IEEE 802.11 and 802.15.4

https://ieeexplore.ieee.org/document/10148905
1•teleforce•3m ago•0 comments

Fault tolerant message passing C# with NATS.io: Use Distributed Object Store

https://nats-io.github.io/nats.net/documentation/object-store/intro.html
1•northlondoner•4m ago•1 comments

Show HN: Google Drive CLI for LLMs / Coding Agents

https://github.com/NmadeleiDev/google-drive-cli
1•Gregoryy•6m ago•0 comments

Ente Locker

https://ente.io/blog/locker/
1•sylens•6m ago•0 comments

Show HN: LinkedRecords – A Server-Sovereign Alternative to Firebase

https://linkedrecords.com/
1•WolfOliver•11m ago•0 comments

Django ORM Standalone⁽¹⁾: Querying an existing database

https://www.paulox.net/2026/02/20/django-orm-standalone-database-inspectdb-query/
1•pauloxnet•13m ago•0 comments

Gentoo Linux moves away from GitHub due to AI

https://www.pcgamer.com/software/linux/after-microsoft-couldnt-keep-its-ai-hands-to-itself-a-noto...
2•majkinetor•14m ago•0 comments

Is there a way to sort/filter by score on Hacker News?

1•7gorillaz•14m ago•0 comments

Nvidia and OpenAI abandon unfinished $100B deal in favour of $30B investment

https://www.ft.com/content/dea24046-0a73-40b2-8246-5ac7b7a54323
4•zerosizedweasle•17m ago•0 comments

Show HN: Geo-lint – open-source linter for GEO (AI search visibility)

https://github.com/IJONIS/geo-lint
1•ijonis•18m ago•1 comments

The battle over Scott Adams' AI afterlife

https://www.businessinsider.com/scott-adams-death-ai-avatar-resurrection-ethics-debate-family-bac...
2•pseudolus•18m ago•1 comments

Sounds in the city: on the utility of acoustic landmarks

https://www.tandfonline.com/doi/full/10.1080/13658816.2026.2617940
1•thinkingemote•21m ago•0 comments

Show HN: E-Rechnung Push – E-Invoicing for German SMBs (ZUGFeRD/XRechnung)

https://e-rechnung-push.de
1•Contenagent•21m ago•0 comments

IR³Pidgins – Universal Language via Pure Flux

https://bitcoin-zero-down-2ea152.gitlab.io/gallery/gallery-item-neg-895/
1•machardmachard•21m ago•2 comments

The Existence and Behavior of Secondary Attention Sinks

https://arxiv.org/abs/2512.22213
1•thw20•22m ago•0 comments

Show HN: OkaiDokai, tool-level firewall for OpenClaw, Claude Code and Codex

https://okaidokai.com
1•cedel2k1•25m ago•0 comments

A Bug Is a Bug, but a Patch Is a Policy: The Case for Bootable Containers

https://tuananh.net/2026/02/20/patch-is-policy/
1•tuananh•26m ago•0 comments

Postgres for analytics: these are the ways

https://www.justpostgres.tech/blog/postgres-for-analytics
2•magden•26m ago•0 comments

Investigating Climate Change Adaptation

https://kit.exposingtheinvisible.org/en/climate-change-adaptation.html
2•Anon84•27m ago•0 comments

Show HN: From Clawdbot to OpenAI: Dissecting the supply chain that sold out

https://the-mind-of-ai.com/posts/openclaw-necropsy/
1•agentic-wiki•29m ago•0 comments

Show HN: Tamper-proof work verification with Ed25519 and RFC 3161 TSA

https://yourbeforeafterwork.netlify.app/
4•sudeshss•29m ago•1 comments

I used Claude Code and GSD to build the accessibility tool I've always wanted

https://blakewatson.com/journal/i-used-claude-code-and-gsd-to-build-the-accessibility-tool-ive-al...
3•todsacerdoti•31m ago•0 comments

Sam Altman says companies are 'AI washing' by blaming layoffs on the technology

https://fortune.com/2026/02/19/sam-altman-confirms-ai-washing-job-displacement-layoffs/
2•Betelbuddy•32m ago•0 comments

AWS Well-Architected Framework

https://docs.aws.amazon.com/wellarchitected/latest/framework/welcome.html
1•Kinrany•34m ago•0 comments

Irish man detained by ICE Update – It's not as it seems

https://www.independent.ie/irish-news/its-a-mess-of-his-own-making-says-daughter-of-seamus-cullet...
1•cauliflower99•37m ago•1 comments

Airbus CEO Criticizes Pratt and Whitney over Engine Supply Issues

https://simpleflying.com/airbus-says-engine-delays-are-stalling-a320-deliveries/
1•Betelbuddy•38m ago•0 comments

Outreach domain for personalized cold emails? LLMs don't agree

2•BroTechLead•39m ago•0 comments

Show HN: Realtime Python Apps Without WebSockets and React Bloat (Stario)

1•bobowski•40m ago•0 comments

Using classic dev books to guide AI agents?

1•ZLStas•41m ago•2 comments
Open in hackernews

The path to ubiquitous AI (17k tokens/sec)

https://taalas.com/the-path-to-ubiquitous-ai/
134•sidnarsipur•1h ago

Comments

notenlish•1h ago
Impressive stuff.
baq•1h ago
one step closer to being able to purchase a box of llms on aliexpress, though 1.7ktok/s would be quite enough
Havoc•1h ago
That seems promising for applications that require raw speed. Wonder how much they can scale it up - 8B model quantized is very usable but still quite small compared to even bottom end cloud models.
metabrew•1h ago
I tried the chatbot. jarring to see a large response come back instantly at over 15k tok/sec

I'll take one with a frontier model please, for my local coding and home ai needs..

grzracz•54m ago
Absolute insanity to see a coherent text block that takes at least 2 minutes to read generated in a fraction of a second. Crazy stuff...
pjc50•44m ago
Accelerating the end of the usable text-based internet one chip at a time.
kleiba•40m ago
Yes, but the quality of the output leaves to be desired. I just asked about some sports history and got a mix of correct information and totally made up nonsense. Not unexpected for an 8k model, but raises the question of what the use case is for such small models.
djb_hackernews•28m ago
You have a misunderstanding of what LLMs are good at.
cap11235•24m ago
Poster wants it to play Jeopardy, not process text.
paganel•15m ago
Not sure if you're correct, as the market is betting trillions of dollars on these LLMs, hoping that they'll be close to what the OP had expected to happen in this case.
IshKebab•7m ago
I don't think he does. Larger models are definitely better at not hallucinating. Enough that they are good at answering questions on popular topics.

Smaller models, not so much.

stabbles•40m ago
Reminds me of that solution to Fermi's paradox, that we don't detect signals from extraterrestrial civilizations because they run on a different clock speed.
dintech•32m ago
Iain M Banks’ The Algebraist does a great job of covering that territory. If an organism had a lifespan of millions of years, they might perceive time and communication differently to say a house fly or us.
xyzsparetimexyz•26m ago
:eyeroll:
impossiblefork•1h ago
So I'm guessing this is some kind of weights as ROM type of thing? At least that's how I interpret the product page, or maybe even a sort of ROM type thing that you can only access by doing matrix multiplies.
readitalready•1h ago
You shouldn't need any ROM. It's likely the architecture is just fixed hardware with weights loaded in via scan flip-flows. If it was me making it, I'd just design a systolic array. Just multipliers feeding into multipliers, without even going through RAM.
loufe•1h ago
Jarring to see these other comments so blindly positive.

Show me something at a model size 80GB+ or this feels like "positive results in mice"

hkt•58m ago
Positive results in mice also known as being a promising proof of concept. At this point, anything which deflates the enormous bubble around GPUs, memory, etc, is a welcome remedy. A decent amount of efficient, "good enough" AI will change the market very considerably, adding a segment for people who don't need frontier models. I'd be surprised if they didn't end up releasing something a lot bigger than they have.
viraptor•57m ago
There are a lot of problems solved by tiny models. The huge ones are fun for large programming tasks, exploration, analysis, etc. but there's a massive amount of processing <10GB happening every day. Including on portable devices.

This is great even if it can't ever run Opus. Many little will be extremely happy about something like Phi at lightning speed.

aurareturn•1h ago
This requires 10 chips for an 8 billion q3 param model. 2.4kW.

10 reticle sized chips on TSMC N6. Basically 10x Nvidia H100 GPUs.

Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.

Interesting design for niche applications.

What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?

teaearlgraycold•53m ago
I'm thinking the best end result would come from custom-built models. An 8 billion parameter generalized model will run really quickly while not being particularly good at anything. But the same parameter count dedicated to parsing emails, RAG summarization, or some other specialized task could be more than good enough while also running at crazy speeds.
danpalmer•53m ago
Alternatively, you could run far more RAG and thinking to integrate recent knowledge, I would imagine models designed for this putting less emphasis on world knowledge and more on agentic search.
freeone3000•8m ago
Maybe; models with more embedded associations are also better at search. (Intuitively, this tracks; a model with no world knowledge has no awareness of synonyms or relations (a pure markov model), so the more knowledge a model has, the better it can search.) It’s not clear if it’s possible to build such a model, since there doesn’t seem to be a scaling cliff.
pjc50•51m ago
Where are those numbers from? It's not immediately clear to me that you can distribute one model across chips with this design.

> Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.

Subtle detail here: the fastest turnaround that one could reasonably expect on that process is about six months. This might eventually be useful, but at the moment it seems like the model churn is huge and people insist you use this week's model for best results.

aurareturn•45m ago

  > The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.
https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
dakolli•36m ago
So it lights money on fire extra fast, AI focused VCs are going to really love it then!!
adityashankar•42m ago
This depends on how much better the models will get from now in, if Claude Opus 4.6 was transformed into one of these chips and ran at a hypothetical 17k tokens/second, I'm sure that would be astounding, this depends on how much better claude Opus 5 would be compared to the current generation
aurareturn•28m ago
I’m pretty sure they’d need a small data center to run a model the size of Opus.
Shaanveer•51m ago
ceo
charcircuit•37m ago
No one would never give such a weak model that much power over a company.
thrance•46m ago
> What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?

Video game NPCs?

aurareturn•38m ago
Doesn’t pass the high value and require tremendous speed tests.
hkt•1h ago
Reminds me of when bitcoin started running on ASICs. This will always lag behind the state of the art, but incredibly fast, (presumably) power efficient LLMs will be great to see. I sincerely hope they opt for a path of selling products rather than cloud services in the long run, though.
hxugufjfjf•1h ago
It was so fast that I didn't realise it had sent its response. Damn.
rotbart•58m ago
Hurrah, its dumb answer to the now classic "the car wash is 100m away, should I drive or walk?" appeared very quickly.
Lalabadie•49m ago
It's an 8B parameter model from a good while ago, what were your expectations?
dakolli•1h ago
try here, I hate llms but this is crazy fast. https://chatjimmy.ai/
bmacho•52m ago

  "447 / 6144 tokens"
  "Generated in 0.026s • 15,718 tok/s"
This is crazy fast. I always predicted this speed in ~2 years in the future, but it's here, now.
Lalabadie•50m ago
The full answer pops in milliseconds, it's impressive and feels like a completely different technology just by foregoing the need to stream the output.
FergusArgyll•32m ago
Because most models today generate slowish, they give the impression of someone typing on the other end. This is just <enter> -> wall of text. Wild
grzracz•59m ago
This would be killer for exploring simultaneous thinking paths and council-style decision taking. Even with Qwen3-Coder-Next 80B if you could achieve a 10x speed, I'd buy one of those today. Can't wait to see if this is still possible with larger models than 8B.
aurareturn•56m ago
It uses 10 chips for 8B model. It’d need 80 chips for an 80b model.

Each chip is the size of an H100.

So 80 H100 to run at this speed. Can’t change the model after you manufacture the chips since it’s etched into silicon.

grzracz•50m ago
I'm sure there is plenty of optimization paths left for them if they're a startup. And imho smaller models will keep getting better. And a great business model for people having to buy your chips for each new LLM release :)
aurareturn•48m ago
One more thing. It seems like this is a Q3 quant. So only 3GB RAM requirement.

10 H100 chips for 3GB model.

I think it’s a niche of a niche at this point.

I’m not sure what optimization they can do since a transistor is a transistor.

ubercore•37m ago
Do we know that it needs 10 chips to run the model? Or are the servers for the API and chatbot just specced with 10 boards to distribute user load?
FieryTransition•33m ago
If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.
dsign•58m ago
This is like microcontrollers, but for AI? Awesome! I want one for my electric guitar; and please add an AI TTS module...
brazzy•30m ago
No, it's ASICs, but for AI.
viftodi•56m ago
I tried the trick question I saw here before, about the make 1000 with 9 8s and additions only

I know it's not a resonating model, but I keep pushing it and eventually it gave me this as part of it's output

888 + 88 + 88 + 8 + 8 = 1060, too high... 8888 + 8 = 10000, too high... 888 + 8 + 8 +ประก 8 = 1000,ประก

I googled the strange symbol, it seems to mean Set in thai?

danpalmer•54m ago
I don't think it's very valuable to talk about the model here, the model is just an old Llama. It's the hardware that matters.
bloggie•55m ago
I wonder if this is the first step towards AI as an appliance rather than a subscription?
dust42•55m ago
This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.

Tech summary:

  - 15k tok/sec on 8B dense 3bit quant (llama 3.1) 
  - probably no KV cache
  - 880mm^2 die, TSMC 6nm, 53B transistors
  - presumably 200W per chip
  - 20x cheaper to produce
  - 10x less energy per token for inference
  - max context size: flexible
  - mid-sized thinking model upcoming this spring on same hardware
  - next hardware supposed to be FP4 
  - a frontier LLM planned within twelve months
This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.

Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.

Not exactly a competitor for Nvidia but probably for 5-10% of the market.

With a bit of googleing and asking various AIs, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Also no info how well the speed scales with the model size.

oliwary•52m ago
This is insane if true - could be super useful for data extraction tasks. Sounds like we could be talking in the cents per millions of tokens range.
aurareturn•51m ago
Don’t forget that the 8B model requires 10 of said chips to run.

And it’s a 3bit quant. So 3GB ram requirement.

If they run 8B using native 16bit quant, it will use 60 H100 sized chips.

dust42•43m ago
> Don’t forget that the 8B model requires 10 of said chips to run.

Are you sure about that? If true it would definitely make it look a lot less interesting.

aurareturn•36m ago
Their 2.4 kW is for 10 chips it seems based on the next platform article.

I assume they need all 10 chips for their 8B q3 model. Otherwise, they would have said so or they would have put a more impressive model as the demo.

https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

audunw•12m ago
It doesn’t make any sense to think you need the whole server to run one model. It’s much more likely that each server runs 10 instances of the model

1. It doesn’t make sense in terms of architecture. It’s one chip. You can’t split one model over 10 identical hardwire chips

2. It doesn’t add up with their claims of better power efficiency. 2.4kW for one model would be really bad.

moralestapia•8m ago
Thanks for having a brain.

Not sure who started that "split into 10 chips" claim, it's just dumb.

This is Llama 3B hardcoded (literally) on one chip. That's what the startup is about, they emphasize this multiple times.

elternal_love•35m ago
Were we go towards really smart roboters. It is interesting what kind of diferent model chips they can produce.
varispeed•31m ago
There is nothing smart about current LLMs. They just regurgitate text compressed in their memory based on probability. None of the LLMs currently have actual understanding of what you ask them to do and what they respond with.
small_model•17m ago
Thats not how they work, pro-tip maybe don't comment until you have a good understanding?
zozbot234•25m ago
Low-latency inference is a waste of power. If you're going to the trouble of making an ASIC, it should be for dog-slow but very high throughput inference. Undervolt the devices as much as possible and use sub-threshold modes, multiple Vt and body biasing extensively to save further power and minimize leakage losses, but also keep working in fine-grained nodes to reduce areas and distances. The sensible goal is to expend the least possible energy per operation, even at increased latency.
dust42•13m ago
Low latency inference is very useful in voice-to-voice applications. You say it is a waste of power but at least their claim is that it is 10x more efficient. We'll see but if it works out it will definitely find its applications.
vessenes•15m ago
This math is useful. Lots of folks scoffing in the comments below. I have a couple reactions, after chatting with it:

1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.

2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.

3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.

However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.

Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.

moralestapia•54m ago
Wow, this is great.

To the authors: do not self-deprecate your work. It is true this is not a frontier model (anymore) but the tech you've built is truly impressive. Very few hardware startups have a v1 as good as this one!

Also, for many tasks I can think of, you don't really need the best of the best of the best, cheap and instant inference is a major selling point in itself.

gozucito•54m ago
Can it scale to an 800 billion param model? 8B parameter models are too far behind the frontier to be useful to me for SWE work.

Or is that the catch? Either way I am sure there will be some niche uses for it.

taneq•52m ago
Spam. :P
Lionga•10m ago
so 90% of the AI market?
Dave3of5•53m ago
Fast but the output is shit due to the contrained model they used. Doubt we'll ever get something like this for the large Param decent models.
raincole•50m ago
It's crazily fast. But 8B model is pretty much useless.

Anyway VCs will dump money onto them, and we'll see if the approach can scale to bigger models soon.

stuxf•47m ago
I totally buy the thesis on specialization here, I think it makes total sense.

Asides from the obvious concern that this is a tiny 8B model, I'm also a bit skeptical of the power draw. 2.4 kW feels a little bit high, but someone else should try doing the napkin math compared to the total throughput to power ratio on the H200 and other chips.

dagi3d•46m ago
wonder if at some point you could swap the model as if you were replacing a cpu in your pc or inserting a game cartridge
mips_avatar•37m ago
I think the thing that makes 8b sized models interesting is the ability to train unique custom domain knowledge intelligence and this is the opposite of that. Like if you could deploy any 8b sized model on it and be this fast that would be super interesting, but being stuck with llama3 8b isn't that interesting.
ACCount37•32m ago
The "small model with unique custom domain knowledge" approach has a very low capability ceiling.

Model intelligence is, in many ways, a function of model size. A small model well fit for a given domain is still crippled by being small.

Some things don't benefit from general intelligence much. Sometimes a dumb narrow specialist really is all you need for your tasks. But building that small specialized model isn't easy or cheap.

Engineering isn't free, models tend to grow obsolete as the price/capability frontier advances, and AI specialists are less of a commodity than AI inference is. I'm inclined to bet against approaches like this on a principle.

hbbio•35m ago
Strange that they apparently raised $169M (really?) and the website looks like this. Don't get me wrong: Plain HTML would do if "perfect", or you would expect something heavily designed. But script-kiddie vibe coded seems off.

The idea is good though and could work.

ACCount37•25m ago
Strange that they raised money at all with an idea like this.

It's a bad idea that can't work well. Not while the field is advancing the way it is.

Manufacturing silicon is a long pipeline - and in the world of AI, one year of capability gap isn't something you can afford. You build a SOTA model into your chips, and by the time you get those chips, it's outperformed at its tasks by open weights models half their size.

Now, if AI advances somehow ground to a screeching halt, with model upgrades coming out every 4 years, not every 4 months? Maybe it'll be viable. As is, it's a waste of silicon.

small_model•14m ago
Poverty of imagination here, plenty uses of this and its a prototype at this stage.
fragkakis•35m ago
The article doesn't say anything about the price (it will be expensive), but it doesn't look like something that the average developer would purchase.

An LLM's effective lifespan is a few months (ie the amount of time it is considered top-tier), it wouldn't make sense for a user to purchase something that would be superseded in a couple of months.

An LLM hosting service however, where it would operate 24/7, would be able to make up for the investment.

YetAnotherNick•34m ago
17k token/sec is $0.18/chip/hr for the size of H100 chip if they want to compete with the market rate[1]. But 17k token/sec could lead to some new usecases.

[1]: https://artificialanalysis.ai/models/llama-3-1-instruct-8b/p...

niek_pas•30m ago
> Though society seems poised to build a dystopian future defined by data centers and adjacent power plants, history hints at a different direction. Past technological revolutions often started with grotesque prototypes, only to be eclipsed by breakthroughs yielding more practical outcomes.

…for a privileged minority, yes, and to the detriment of billions of people whose names the history books conveniently forget. AI, like past technological revolutions, is a force multiplier for both productivity and exploitation.

est31•30m ago
I wonder if this makes the frontier labs abandon the SAAS per-token pricing concept for their newest models, and we'll be seeing non-open-but-on-chip-only models instead, sold by the chip and not by the token.

It could give a boost to the industry of electron microscopy analysis as the frontier model creators could be interested in extracting the weights of their competitors.

The high speed of model evolution has interesting consequences on how often batches and masks are cycled. Probably we'll see some pressure on chip manufacturers to create masks more quickly, which can lead to faster hardware cycles. Probably with some compromises, i.e. all of the util stuff around the chip would be static, only the weights part would change. They might in fact pre-make masks that only have the weights missing, for even faster iteration speed.

clbrmbr•29m ago
What would it take to put Opus on a chip? Can it be done? What’s the minimum size?
FieryTransition•28m ago
If it's not reprogrammable, it's just expensive glass.

If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.

This can give huge wafers for a very set model which is old by the time it is finalized.

Etching generic functions used in ML and common fused kernels would seem much more viable as they could be used as building blocks.

retrac98•28m ago
Wow. I’m finding it hard to even conceive of what it’d be like to have one of the frontier models on hardware at this speed.
Mizza•25m ago
This is pretty wild! Only Llama3.1-8B, but this is only their first release so you can assume they're working on larger versions.

So what's the use case for an extremely fast small model? Structuring vast amounts of unstructured data, maybe? Put it in a little service droid so it doesn't need the cloud?

stego-tech•22m ago
I still believe this is the right - and inevitable - path for AI, especially as I use more premium AI tooling and evaluate its utility (I’m still a societal doomer on it, but even I gotta admit its coding abilities are incredible to behold, albeit lacking in quality).

Everyone in Capital wants the perpetual rent-extraction model of API calls and subscription fees, which makes sense given how well it worked in the SaaS boom. However, as Taalas points out, new innovations often scale in consumption closer to the point of service rather than monopolized centers, and I expect AI to be no different. When it’s being used sparsely for odd prompts or agentically to produce larger outputs, having local (or near-local) inferencing is the inevitable end goal: if a model like Qwen or Llama can output something similar to Opus or Codex running on an affordable accelerator at home or in the office server, then why bother with the subscription fees or API bills? That compounds when technical folks (hi!) point out that any process done agentically can instead just be output as software for infinite repetition in lieu of subscriptions and maintained indefinitely by existing technical talent and the same accelerator you bought with CapEx, rather than a fleet of pricey AI seats with OpEx.

The big push seems to be building processes dependent upon recurring revenue streams, but I’m gradually seeing more and more folks work the slop machines for the output they want and then put it away or cancel their sub. I think Taalas - conceptually, anyway - is on to something.

small_model•22m ago
Scale this then close the loop and have fabs spit out new chips with latest weights every week that get placed in a server using a robot, how long before AGI?
aetherspawn•21m ago
This is what’s gonna be in the brain of the robot that ends the world.

The sheer speed of how fast this thing can “think” is insanity.

Adexintart•21m ago
The token throughput improvements are impressive. This has direct implications for usage-based billing in AI products — faster inference means lower cost per request, which changes the economics of credits-based pricing models significantly.
kanodiaayush•14m ago
I'm loving summarization of articles using their chatbot! Wow!
trentnix•13m ago
The speed of the chatbot's response is startling when you're used to the simulated fast typing of ChatGPT and others. But the Llama 3.1 8B model Taalas uses predictably results in incorrect answers, hallucinations, poor reliability as a chatbot.

What type of latency-sensitive applications are appropriate for a solution like this? I presume this type of specialization is necessary for robotics, drones, or industrial automation. What else?

app13•11m ago
Routing in agent pipelines is another use. "Does user prompt A make sense with document type A?" If yes, continue, if no, escalate. That sort of thing
freeone3000•6m ago
Maybe summarization? I’d still worry about accuracy but smaller models do quite well.
japoneris•9m ago
I am super happy to see people working on hardware for local llm. Yet, isnt it premature ? Space is still evolving. Today, i refuse to buy a gpu because i do not know what will be the best model tomorrow. Waiting to get a on the shelf device to run an opus like model
danielovichdk•6m ago
Is this hardware for sale ? The site doesn't say.
shevy-java•6m ago
"Many believe AI is the real deal. In narrow domains, it already surpasses human performance. Used well, it is an unprecedented amplifier of human ingenuity and productivity."

Sounds like people drinking the Kool-Aid now.

I don't reject that AI has use cases. But I do reject that it is promoted as "unprecedented amplifier" of human xyz anything. These folks would even claim how AI improves human creativity. Well, has this been the case?

faeyanpiraat•5m ago
For me, this is entirely true.

I'm progressing with my side projects like I've never before.

Bengalilol•5m ago
Could anyone provide a price range for such component?