Cerebras achieves 2,500T/s on Llama 4 Maverick (400B)

https://www.cerebras.ai/press-release/maverick

91•ByteAtATime•1d ago

Comments

y2244•1d ago

Investors list include Altman and Ilya

https://www.cerebras.ai/company

ryao•1d ago

Their CEO is a felon who plead guilty to accounting fraud:

https://milled.com/theinformation/cerebras-ceos-past-felony-...

Experienced investors will not touch them:

https://www.nbclosangeles.com/news/business/money-report/cer...

I estimated last year that they can only produce about 300 chips per year and that is unlikely to change because there are far bigger customers for TSMC that are ahead of them in priority for capacity. Their technology is interesting, but it is heavily reliant on SRAM and SRAM scaling is dead. Unless they get a foundry to stack layers for their wafer scale chips or design a round chip, they are unlikely to be able to improve their technology very much past the CSE-3. Compute might somewhat increase in the CSE-4 if there is one, but memory will not increase much if at all.

I doubt the investors will see a return on investment.

impossiblefork•1d ago

While the CEO stuff is a problem, I don't think the other stuff matters.

Per chip area WSE-3 is only a little bit more expensive than H200. While you may need several WSE-3s to load the model, if you have enough demand that you are running the WSE-3 at full speed you will not be using more area in the WSE-3. In fact, the WSE-3 may be more efficient, since it won't be loading and unloading things from large memories.

The only effect is that the WSE-3s will have a minimum demand before they make sense, whereas an H200 will make sense even with little demand.

ryao•22h ago

I did the math last year to estimate how many wafers per year Nvidia had, and from my recollection it was >50,000. Cerebras with their ~300 per year is not able to handle the inference needs of the market. It does not help that all of their memory must be inside the wafer, which limits the amount of die area they have for actual logic. They have no prospect for growth unless TSMC decides to bless them or they switch to another foundation.

> While you may need several WSE-3s to load the model, if you have enough demand that you are running the WSE-3 at full speed you will not be using more area in the WSE-3.

You need ~20 wafers to run the Llama 4 Behemoth model on Cerebras hardware. This is close to a million mm^2. The Nvidia hardware that they used in their comparison should have less than 10,000 mm^2 die area, yet can run it fine thanks to the external DRAM. How is the CSE-3 not using more die area?

> In fact, the WSE-3 may be more efficient, since it won't be loading and unloading things from large memories.

This makes no sense to me. Inference software loads the model once and then uses it multiple times. This should be the same for both Nvidia and Cerebras.

impossiblefork•21h ago

Yes, on an ordinary GPU it loads the weights to GPU memory, but then these weights must be moved from GPU memory onto the chip. But on these the weights can presumably be kept on chip entirely-- that's basically their whole point, so with the Cerebras there's no need to ever move weights to the chip.

Of course these guys depend on getting chips, but so does everybody. I don't know how difficult it is, but all sorts of entities get TSMC 5nm. Maybe they'll get TSMC 3nm and 2nm later than NVIDIA, but it's also possible that they don't.

ryao•19h ago

The CSE-3 is divided into 900,000 PEs, which each have only 48kB of RAM:

https://hc2024.hotchips.org/assets/program/conference/day2/7...

Similarly, the SMs in Blackwell have up to 228kB of RAM:

https://docs.nvidia.com/cuda/archive/12.8.0/pdf/Blackwell_Tu...

If you need anything else, you need to load it from elsewhere. In the CSE-3, that would be from other PEs. In Blackwell, that would be from on package DRAM. Idle time in Blackwell be mitigated by parallelism, since each SM has SRAM for multiple kernels to run in parallel. I believe the CSE-3 is quick enough that they do not need that trick.

The other guy said “you will not be using more area in the WSE-3”. I do not see this die area efficiency. You need many full wafers (around 20 with Llama 4 Maverick) to do the same thing with the CSE-3 that can be done with a fraction of a wafer with Blackwell. Even if you include the DRAM’s die area, Nvidia’s hardware is still orders of magnitude more efficient in terms of die area.

The only advantage Cerebras has as far as I can see is that they are fast on single queries, but they do not dare advertise figures for their total throughput, while Nvidia will happily advertise those. If they were better than Nvidia at throughput numbers, Cerebras would advertise them, since that is what matters for having mass market appeal, yet they avoid publishing those figures. That is likely because in reality, they are not competitive in throughput.

To give an example of Nvidia advertising throughput numbers:

> In a 1-megawatt AI factory, NVIDIA Hopper generates 180,000 tokens per second (TPS) at max volume, or 225 TPS for one user at the fastest.

https://blogs.nvidia.com/blog/ai-factory-inference-optimizat...

Cerebras strikes me as being like Bugatti, which designs cars that go from start to finish very fast at a price that could buy dozens of conventional vehicles, while Nvidia strikes me as being like Toyota, which designs far lower vehicles, but can manufacture them in a volume that is able to handle a large amount of the world’s demand for transport. Bugatti can make enough vehicles to bring a significant proportion of the world from A to B regularly, while Toyota can. Similarly, Cerebras cannot make enough chips to handle any significant proportion of the world’s demand for inference, while Nvidia can.

impossiblefork•19h ago

I don't really see how NVIDIA shipping so many chips matters. If more people want Cerebras chips they will presumably be manufactured.

I agree that Cerebras manufacture <300 wafers per year. Probably around 250-300, calculated from $1.6-2 million per unit and their 2024 revenue.

I don't really see how that matters though. I don't see how core counts matter, but I assume that Cerebras is some kind of giant VLIW-y thing where you can give different instructions to different subprocessors.

I imagine that the model weights would be stored in little bits on each processor and that it does some calculation and hands it on.

Then you never need to load the the weights, the only thing you're passing around is activations with them going from wafer 1, to wafer 2, etc. to wafer 20. When this is running at full speed, I believe that this can be very efficient, better than a small GPU like those made by NVIDIA.

Yes, a lot of the area will be on-chip memory/SRAM, but a lot of it will also be logic and that logic will be computing things instead of being used to move things from RAM to on-chip memory.

I don't have any deep knowledge of this system, really, nothing beyond what I've explained here, but I believe that Mistral are using these systems because they're completely superb and superior to GPUs for their purposes, and they will made a carefully weighed decision based on actual performance and actual cost.

ryao•19h ago

You replied really quickly when I had thought I could sneak in a revision, which dropped the estimates for production numbers. In any case, the Cerebras CSE-3 is extremely inefficient for what it does. Inference is memory bandwidth bound, such that peak performance for a single query should be close to the memory bandwidth divided by the weights. Despite having. 2600x the memory bandwidth, they can only perform 2.5 times faster. 1000x of their supposed memory bandwidth is wasted. There are extreme inefficiencies in their architecture. Meanwhile, Nvidia is often within >80% of what memory bandwidth divided by weights predict their hardware can do.

Mistral is a small fish in the grander scheme of things. I would assume that using Cerebras is a way to try to differentiate themselves in a market where they are largely ignored, which is the reason Mistral is small enough to be able to have their needs handled by Cerebras. If they grow to OpenAI levels, there is no chance of Cerebras being able to handle the demand for them.

Finally, I had researched this out of curiosity last year. I am posting remarks based on that.

impossiblefork•11h ago

Inference is memory bandwidth bound on a GPU, which has very little on-chip memory.

On WSE-3s however, there's enough memory that the model can actually be stored on-chip provided that you have a sufficient number of them. 20 are enough for some of the largest open models.

This, depending on how it's set up, allows more efficient use of what logic is available, for actually doing computations instead of just loading and unloading the weights. This can potentially make a system like this much more efficient than a GPU.

It doesn't matter whether Mistral are small fish or not. I don't agree that they are small fish, but whether or not they are they are experts. They are very capable people. They haven't chosen Cerebras to be different, they've chosen it because they believe it's the best way to do inference.

ryao•2m ago

You are lecturing someone who actually has worked on code for doing inference:

https://github.com/ryao/llama3.c

Your “more efficient” remarks are nonsensical to me. Your “loading and unloading weights” remark would be slightly less nonsensical if you called it to Von Neumann bottleneck, but unfortunately for you, their hardware is so bottlenecked internally that they they are getting less than 0.1% of the performance that their supposedly high memory bandwidth can give them.

Efficiency typically is discussed on things like energy consumption or cost, not the von Neumann bottleneck. Cerebras claims 23kW per CSE-3 and they need about 20 of them for Llama 4 Maverick, so that is 460kW:

https://www.cerebras.ai/blog/cerebras-cs-3-vs-nvidia-b200-20...

Nvidia claims that the power supplies for the DGX B200 consume 14.3 kW max:

https://docs.nvidia.com/dgx/dgxb200-user-guide/introduction-...

Actual power consumption will likely be somewhat lower for both, but there is still a huge difference between the two of them.

Cost wise, you need to pay $40 million for the CSE-3 equipment and only $0.5 million for the DGX B200. Paying 80 times the amount for 2.5 times the performance in a batch 1 configuration that nobody uses is absurdly inefficient as far as use of money is concerned. The KV cache needed for context will consume a significant amount of memory, such that you will be limited in both context and simultaneous queries on the Cerebras hardware while Nvidia hardware will be far less constrained from having far more memory. In specific, the DGX B200 has 1.4TB while 20x CSE-3 has 880GB. If you buy 80 of them, you get two orders of magnitude more memory than the CSE-3 gives for the same price. If you actually did do this, then you could say that the Cerebras machine is using less power, but it would be hopelessly outmatched in terms of parallelism. Let’s say that it can do 16 queries in parallel while maintaining 2500T/s on each (which is generous toward Cerebras) for a total of 40,000T/sec. Without doing any parallel queries at all on the Nvidia hardware, would be doing 80,000T/sec. Let’s say you do 16 parallel queries on the Nvidia hardware too and lets generously (toward Cerebras) assume that each only gives 500 T/sec. Then you are doing 640,000T/sec. Of course, Nvidia has the ability to go higher. The Cerebras hardware on the other hand cannot keep going higher without more $2 million nodes to expand the memory. Each of which could buy 4 more Nvidia DGX B200 nodes that would do even more inferencing.

Calling Cerebras the best way of doing inference is ridiculous. We are talking about doing linear algebra. There is no best way of doing it. Pointing at Mistral to say that Cerebras has the best way is an absurd appeal to authority. None of the major players are using them, since they are incapable of handling their needs. The instant responses are nice and are a way for mistral to differentiate itself, but their models are not as good as those from others and few people use them.

moralestapia•1d ago

>Their CEO is a felon who plead guilty to accounting fraud [...]

Whoa, I didn't know that.

I know he's very close to another guy I know first hand to be a criminal. I won't write the name here for obvious reasons, also not my fight to fight.

I always thought it was a bit weird of them to hang around because I never got that vibe from Feldman, but ... now I came to know about this, 2nd strike I guess ...

canucker2016•1d ago

CNBC lists several other red flags (one customer generating >80% of revenue, non-top-tier investment bank/auditor).

see https://www.cnbc.com/2024/10/11/cerebras-ipo-has-too-much-ha...

IPO was supposed to happen in autumn 2024.

arisAlexis•1d ago

Openai wanted to buy them. G42 the largest player in middle east owne a big chunk. You are simply wrong about big investors not touching them but my guess is they will be bought soon by Meta or Apple.

threeseed•1d ago

> Apple

I can't imagine Apple being interested.

Their priority is figuring out how to optimise Apple Silicon for LLM inference so it can be used in laptops, phones and data centres.

bigyabai•1d ago

I can only imagine Apple being interested. Their NPU hardware is slower than Qualcomm's, their GPUs have been lagging behind Nvidia in all fields since launch, and they refuse to work with any industry leaders to ship a COTS solution. They don't have many options left on the table, "figuring out how to optimize Apple Silicon" has been the plan for 6 years now and no CUDA-killers have come up out of the woodworks since then.

Either Apple entirely forfeits AI to the businesses capable of supplying it, or they change their tactic and do what Apple does best; grossly overpay for a moonshot startup that promises "X for the iPhone". I don't know if that implicates Cerebras, but clearly Apple didn't retain the requisite talent to compete for commercial AI inference capacity.

ryao•22h ago

Cerebras’ technology works by using an entire wafer as a chip and power draw is 23kW if I recall correctly. Their technology cannot be scaled down and only works when scaling up. They could not be more useless for Apple’s purposes. Acquiring them would only give them a bunch of chip design engineers that might or might not be able to make a decent NPU that uses DRAM.

That said, Apple has some talented people already and they likely just need to iterate to make their designs better. Bringing new people on board would just slow progress (see the mythical man month).

ryao•1d ago

> At over 2,500 t/s, Cerebras has set a world record for LLM inference speed on the 400B parameter Llama 4 Maverick model, the largest and most powerful in the Llama 4 family.

This is incorrect. The unreleased Llama 4 Behemoth is the largest and most powerful in the Llama 4 family.

As for the speed record, it seems important to keep it in context. That comparison is only for performance on 1 query, but it is well known that people run potentially hundreds of queries in parallel to get their money out of the hardware. If you aggregate the tokens per second across all simultaneous queries to get the total throughput for comparison, I wonder if it will still look so competitive in absolute performance.

Also, Cerebras is the company that not only was saying that their hardware was not useful for inference until some time last year, but even partnered with Qualcomm with the claim that Qualcomm’s accelerators had a 10x price performance improvement over their things:

https://www.cerebras.ai/press-release/cerebras-qualcomm-anno...

Their hardware does inference with FP16, so they need ~20 of their CSE-3 chips to run this model. Each one costs ~$2 million, so that is $40 million. The DGX B200 that they used for their comparison costs ~$500,000:

https://wccftech.com/nvidia-blackwell-dgx-b200-price-half-a-...

You only need 1 DGX B200 to run Llama 4 Maverick. You could buy ~80 of them for the price it costs to buy enough Cerebras hardware to run Llama 4 Maverick.

Their latencies are impressive, but beyond a certain point, throughput is what counts and they don’t really talk about their throughput numbers. I suspect the cost to performance ratio is terrible for throughput numbers. It certainly is terrible for latency numbers. That is what they are not telling people.

Finally, I have trouble getting excited about Cerebras. SRAM scaling is dead, so short of figuring out how to 3D stack their wafer scale chips, during fabrication at TSMC, or designing round chips, they have a dead end product since it relies on using an entire wafer to be able to throw SRAM at problems. Nvidia, using DRAM, is far less reliant on SRAM and can use more silicon for compute, which is still shrinking.

littlestymaar•1d ago

> This is incorrect. The unreleased Llama 4 Behemoth is the largest and most powerful in the Llama 4 family.

Emphasis mine.

Behemoth may become the largest and most powerful llama model, but right now it's nothing but vaporware. Maverick is currently the largest and more powerful llama model today (and if I had to bet, my money would be on Meta discarding Llama4 Behemoth entirely it eventually without having released it, and moving on to the next version number).

timschmidt•1d ago

> I have trouble getting excited about Cerebras. SRAM scaling is dead, so short of figuring out how to 3D stack their wafer scale chips

AMD and TSMC are stacking SRAM on the chip scale. I imagine they could accomplish it at the wafer scale. It'll be neat if we can get hundreds of layers in time, like flash.

Your analysis seems spot on to me.

nsteel•1d ago

Assume you meant Intel, rather than AMD?

timschmidt•1d ago

https://www.amd.com/en/products/processors/technologies/3d-v... and future developments.

nsteel•20h ago

Yes, and it's TSMC enabling this. Lots of TSMC's customers going this route, not just AMD. Seemed odd to call out AMD as if they've got any special sauce here.

timschmidt•20h ago

My choices can seem odd to you, that's fine. Have a nice day!

latchkey•19h ago

More on the CPU side than the GPU side. GPU is still dominated by HBM.

attentive•1d ago

> Also, Cerebras is the company that not only was saying that their hardware was not useful for inference until some time last year, but even partnered with Qualcomm with the claim that Qualcomm’s accelerators had a 10x price performance improvement over their things

Mistral says they run Le Chat on Cerebras

arisAlexis•1d ago

Also perplexity

ryao•21h ago

How is that related to the claim that Cerebras themselves made about their hardware’s price performance ratio?

https://www.cerebras.ai/press-release/cerebras-qualcomm-anno...

bubblethink•1d ago

>Each one costs ~$2 million, so that is $40 million.

Pricing for exotic hardware that is not manufactured at scale is quite meaningless. They are selling tokens over an API. The token pricing is competitive with other token APIs.

jenny91•1d ago

I agree on the first. On the second: I would bet a lot of money that they aren't actually breaking even on their API (or even close to). They don't have a "pay as you go" per-token tier, it's all geared up to demonstrate use of their API as a novelty. They're probably burning cash on every single token. But their valuation and hype has surely gone way up since they got onto LLMs.

bubblethink•1d ago

They seem to have dev tier pricing (https://inference-docs.cerebras.ai/support/pricing). It's likely that they don't make much money on this and only make money on large enterprise contracts.

ryao•22h ago

Last year, I took the time to read through public documents and estimated that their annual production was limited to ~300 wafers per year from TSMC. That is not Nvidia level scale, but it is scale.

There are many companies that sell tokens from an API and many more that need hardware to compute tokens. Cerebras posted a comparison of hardware options for these companies, so evaluating it as such is meaningful. It is perhaps less meaningful to the average person who cannot afford the barrier to entry to afford this hardware, but there are plenty of people curious what the options are for the companies that sell tokens through APIs, as those impact available capacity.

latchkey•19h ago

> There are many companies that sell tokens from an API

I was just at Dell Tech World and they proudly displayed a slide during the CTO keynote that said:

"Cost per token decreased 4 orders of magnitude"

Personally speaking, not a business I'd want to get into.

ryao•18h ago

Some context is needed for this. The only way to get a 4 orders of magnitude difference would be to compare incomparable things, like OpenAI’s most expensive model versus llama 3.1 8B.

moralestapia•1d ago

>Their hardware does inference with FP16, so they need ~20 of their CSE-3 chips to run this model.

Care to explain? I don't see it.

acchow•1d ago

CSE-3 chip has 44GB, which can hold 22B parameters in FP16.

400B parameters would need 18 chips. Then you need a bit more ram for other stuff

moralestapia•1d ago

That's on-chip SRAM, comparable to a GPU's L1 cache, of which it typically has megabytes.

CSE systems also come with off-chip memory, comparable to a GPU's memory, but usually in the TB range.

acchow•23h ago

If you want the titled 2500 tokens/second, you need to use the on-chip SRAM

moralestapia•22h ago

What?

Of course they're using the on-chip SRAM, why wouldn't they?

This is a press release from Cerebras about a Cerebras chip, ... of course they are using a Cerebras chip!

Is that not obvious?

ryao•21h ago

They also support external DRAM over their 150GB/sec system IO link. They call it MemoryX and talk about it on these blog posts:

https://www.cerebras.ai/blog/cerebras-cs-3-vs-nvidia-b200-20...

https://www.cerebras.ai/blog/announcing-the-cerebras-archite...

It is useless for inference, but it is great for training. It used to be more prominent on their website, but it is harder to find references to it now that they are mimicking Groq’s business model.

ryao•23h ago

The memory bandwidth for that is 150GB/sec. Inference speed is memory bandwidth bound, so that memory is useless for inference. Discrete GPUs will run circles around the CSE-3 at inference if they tried using the external DRAM.

moralestapia•22h ago

Where do you get those 150GB/sec from?

Here [1] they imply they can reach 1.2Tbps (allegedly, I know), and that's the previous generation ...

1: https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20B...

rkomorn•22h ago

Doesn't 1.2Tbps / 8 = 150 GBps because 8b = 1B ?

moralestapia•22h ago

That's ... right! Huh, missed that (assuming all units were written properly and mean what they mean).

Edit: yeah, double checked their site and everything. Dang, their IO is indeed "slow". They claim 1 microsecond latencies, but still, an H100 can move much more data than that.

ryao•21h ago

The other comment already clarified that 150GB/sec = 1.2Tbps. That said, the CSE-3 did not change this figure. It is buried in their specification sheets somewhere if you care to search for it. I did last year, which is how I know.

x-complexity•1d ago

Pretty much no disagreements IMO.

By the time the CSE-5 is rolled out, it *needs* at least 500GB of SRAM to make it worthwhile. Multi-layer wafer stacking's the only path to advance this chip.

skryl•1d ago

Performance per watt is better than h100 and b200, performance per watt per $ is worse than B200, and it does fp8 just fine

https://arxiv.org/pdf/2503.11698

lern_too_spel•22h ago

Performance per watt per dollar is a useless metric as calculated. You can't spend more money on B200s to get more performance per watt.

ryao•22h ago

Thanks for the correction. They are currently using FP16 for inference according to OpenRouter. I had thought that implied that they could not use FP8 given the pressure that they have to use as little memory as possible from being solely reliant on SRAM. I wonder why they opted to use FP16 instead of FP8.

skryl•19h ago

One caveat is that this paper only covers training, which can be done on a single CS-3 using external memory (swapping weights in and out of SRAM). There is no way that a single CS-3 will hit this record inference performance with external memory so this was likely done with 10-20 CS-3 chips and the full model in SRAM. Definitely can’t compare token/$ with that kind of setup vs a DGX.

addaon•1d ago

> SRAM scaling is dead

I'm /way/ outside my expertise here, so possibly-silly question. My understanding (any of which can be wrong, please correct me!) is that (a) the memory used for LLMs is dominantly parameters, which are read-only during inference; (b) SRAM scaling may be dead, but NVM scaling doesn't seem to be; (c) NVM read bandwidth scales well locally, within an order of magnitude or two of SRAM bandwidth, for wide reads; (d) although NVM isn't currently on leading-edge processes, market forces are generally pushing NVM to smaller and smaller processes for the usual cost/density/performance reasons.

Assuming that cluster of assumptions is true, does that suggest that there's a time down the road where something like a chip-scale-integrated inference chip using NVM for parameter storage solves?

ryao•22h ago

The processes used for logic chips, and the processes used for NVM are typically different. The only case I know of the industry combining them onto a single chip would be Texas Instruments’ MSP430 microcontrollers with FeRAM, but the quantities of FeRAM are incredibly small there and the process technology is ancient. It seems unlikely to me that the rest of the industry will combine the processes such that you can have both on a single wafer, but you would have better luck asking a chip designer.

That said, NVM often has a wear-out problem. This is a major disincentive for using it in place of SRAM, which is frequently written. Different types of NVM have different endurance limits, but if they did build such a chip, it is only a matter of time before it stops working.

addaon•20h ago

> The only case I know of the industry combining them onto a single chip would be Texas Instruments’ MSP430 microcontrollers with FeRAM

Every microcontroller with on-chip NVM would count. Down to 45 nm, this is mostly Flash, with the exception of the MSP430's FeRAM. Below that... we have TI pushing Flash, ST pushing PCM, NXP pushing MRAM, and Infineon pushing (TSMC's) RRAM. All on processes in the 22 nm (planar) range, either today or in the near future.

> This is a major disincentive for using it in place of SRAM, which is frequently written.

But isn't parameter memory written once per model update, for silicon used for inferencing on a specific model? Even with daily writes the typical 10k - 1M allowable writes for most of the technologies above would last decades.

ryao•18h ago

I had been unaware of the others. Anyway, you need writes to the KV cache for every token generated. You are going to hit that fast.

turblety•1d ago

Maybe one day they’ll have an actual api that you can pay per token. Right now it’s the standard “talk to us” if you want to use it.

twothreeone•1d ago

Huh? Just make an account, get your API key, and try out the free tier.. works for me.

https://cloud.cerebras.ai

M4v3R•1d ago

Yep, can confirm, I used their API just fine for Llama 4 Scout for weeks now.

bn-l•1d ago

> that you can pay per token

iansinnott•1d ago

Although not obvious, you _can_ pay them per token. You have to use OpenRouter or Huggingface as the inference API provider.

https://cerebras-inference.help.usepylon.com/articles/192554...

kristianp•1d ago

Interestingly, llama 4 maverick isn't available on that page, only scout.

bn-l•1d ago

Yeap looks like it’s just scout and lower.

turblety•1d ago

Oh, this is cool. Didn’t know they are on openrouter. Thanks.

lordofgibbons•1d ago

Very nice. Now for their next trick they should offer inference on actually useful models like DeepSeek R1 (not the distills).

thawab•1d ago

are the Llama 4 issues fixed? what is it good at? coding is out of the window after the updated R1.

NitpickLawyer•1d ago

Yes, the issues were fixed ~1-2 weeks after release. It's a good "all-rounder" model, best compared to 4o. Good multilingual capabilities, even in languages not specifically highlighted. Fast to run inference on it. Code is not one of its strong suits at all.

bob1029•1d ago

I think it is too risky to build a company around the premise that someone won't soon solve the quadratic scaling issue. Especially, when that company involves creating ASICs.

E.g.: https://arxiv.org/abs/2312.00752

qeternity•1d ago

Attention is not the primary inference bottleneck. For each token you have to load all of the weights (or activated weights) from memory. This is why Cerebras is fast: they have huge memory bandwidth.

Havoc•1d ago

Yeah also strikes me as quite risky. Their gear seems very focused on llama family specifically.

Just takes one breakthrough and it's all different. See the recent diffusion style LLMs for example

tryauuum•1d ago

yes, was not obvious it's not terabytes per second

Alifatisk•1d ago

In the context of LLMs, the unit is token and to measure the output it's tokens per second (T/s)

diggan•1d ago

> The most important AI applications being deployed in enterprise today—agents, code generation, and complex reasoning—are bottlenecked by inference latency

Is this really true today? I don't work in enterprise, so don't know how things look like, but I'm sure lots of people here do, and it feels unlikely that inference latency is the top bottleneck, even above humans or waiting for human input? Maybe I'm just using LLMs very differently from how they're deployed in a enterprise, but I'm by far the biggest bottleneck in my setup currently.

baq•1d ago

It is if you want good results. I’ve been giving Gemini pro prompts for 200+ seconds multiple times per day this week and for such tasks I really like to make it double/triple check and sometimes give the results to Claude for review, too (and vice versa).

Ideally I can just run the prompt 100x and have it pick the best solution later. That’s prohibitively expensive and a waste of time today.

diggan•1d ago

> That’s prohibitively expensive

Assuming you experience is working within enterprise, you're then saying that cost is the biggest bottleneck currently?

Also surprising to me that enterprises would use out-of-the-box models like that, I was expecting at least fine-tuned models be used most of the time, for very specific tasks/contexts, but maybe that's way optimistic.

threeseed•1d ago

Cost is irrelevant when compared to the salaries of the people using them so they will do basic cost controls but nothing too onerous. And cost is never a reason to prevent solutions being built and deployed.

And most enterprises aren't even doing anything advanced with AI. Just doing POCs with chat bots (again) which will likely fail (again). Or trying to do enterprise search engines which are pointless because most content is isolated per team. Or a few OCR projects which is pretty boring and underwhelming.

baq•20h ago

Cost would be the biggest factor if price per token was the same but tokens were arriving 100x faster. (Not particularly unexpected I’d say.)

tiffanyh•1d ago

How do you create a prompt for Gemini to spend 200 seconds and review multiple times.

Is it as simple as stating in the prompt:

  Spend 200+ seconds and review multiple times <question/task>

baq•20h ago

You give it a task from hell which the devil himself outsources, like ‘figure out how these fifty repositories of yaml blobs, jinja templates and code generating code generating hcl generating yaml interact to define the infrastructure, then add something to it with correct iams, then make a matching blob of yaml pipelines to work with that infrastructure’

threeseed•1d ago

Only an insignificant minority of companies are running their own AI LLM models.

Everyone else is perfectly fine using whatever Azure, GCP etc provide. Enterprise companies don't need to be the fastest or have the best user experience. They need to be secure, trusted and reliable. And you get that by using cloud offerings by default and only going third party when there is a serious need.

aktuel•1d ago

If you think that cloud offerings are secure and trustworthy by default you truly must be living under a rock.

threeseed•1d ago

I have worked for a dozen companies all earnt more than $20b a year in revenue. That includes two banks and a hedge fund. All use the cloud.

You must be living under a rock if you think the cloud isn't secure enough for the enterprise.

aktuel•1d ago

Some quant-heads endorsing the latest fad doesn't prove anything. Also they don't care if chinese hackers are vacuuming data cause ballstreet doesn't care about sustainability. But I grant you that secure and trust are just words that don't mean anything anymore anyhow.

subscribed•1d ago

LOL, all fintech are using or entering the "cloud" very heavily. Cloud is here for long enough that claiming it's insecure shows only the immense ignorance.

aktuel•1d ago

https://www.bleepingcomputer.com/news/security/oracle-custom...

Just one of the later examples of a very long list of cloud data breaches affecting millions of users. But hey who cares as long as it does not affect your own bottom line.

subscribed•1d ago

This has affected login data and yeah, it's famously oracle.

Any fintech (and these can afford smart people) is building with defense in depth, encrypting everything with their own keys, using ephemeral credentials (eg issued by hashicorp vault), etc, etc.

You're seemingly applying your own experience with cloud-based storage, like Dropbox, to the enterprise cloud-based infrastructure.

I don't feel like I should spend any time laying out my professional experience with these environments, I guess you could just skim through one of the books and watch a couple hours long video explaining layers of the leading "cloud" offerings.

And yes, eventually the breach will happen. Like it happens on premise all the time. 2014 Sony and 2020 Solar Winds are good examples.

Let's agree to disagree, I really don't want to spend any more time on this, I know how a good solution (passing multiple audits and pentests) looks like, you however have your opinion. I'm not going to fight you :)

Take care!

K0balt•1d ago

Any business using commercial inference providers is potentially risking their value proposition. Everything you send to cloud inference will eventually be gleaned for training data.

Empirically we know that the data is the most valuable input to cloud services, and eventually it will be used, regardless of the user agreement. When the stored data becomes worth more than the company, it will be eaten and stripped by vulture capital. Law of the jungle, baby.

snypher•1d ago

>Cloud is here for long enough that claiming it's insecure shows only the immense ignorance

Such a bizarre interpretation considering we still use SMS

K0balt•1d ago

I think the key here is twofold. First “the cloud” as commonly understood isn’t what anyone here is talking about. The subject is commercial inference providers.

The “cloud”, or Commercial offerings in storage, VMs, etc are reasonably “secure” in a very general context these days, that is generally true.

OTOH “cloud” AI (commercial inference) is going to use your data for training, incorporating your business processes and domain specific competencies into its innate capabilities, which could eventually impact your value proposition. Empirically, this will happen, eventually, regardless of the user agreement that you signed.

Leakage of proprietary competencies is what is meant by being insecure, in this context.

Second, “cloud isn't secure enough for the enterprise” should be replaced with “enterprise actually cares about security except as a cost/benefit analysis”.

Sending your data to someone else’s data center is a really good way for your data to potentially end up on someone else’s computer. In fact, it’s pretty much the point. If security was the priority, they wouldn’t do that.

Toritori12•1d ago

I feel a lot companies do it to reduce liability. It may not be more secure, but it is not their problem.

UltraSane•11h ago

AWS is in fact extremely secure.

qu0b•1d ago

True, the biggest bottleneck is formulating the right task list and ensuring the LLM is directed to find the relevant context it needs. I feel LLMs in their instruction following are often to eager to output rather than using tools (read files) in their reasoning step.

bravesoul2•1d ago

I tried some Llama 4s on Cerebras and they were hallucinating like they were on drugs. I gave it a URL to analyse a post for style and it made it all up and didn't look at the url (or realize that it hadn't looked at it).

geor9e•16h ago

I love Cerebrus. 10-100x faster than the other options. I really wish the other companies realized that some of us prefer our computer be instant. I use their API (with Qwen3 reasoning model) for ~99% of my questions, and the whole answer finishes in under 0.1 seconds. Keeps me in a flow state. Latency is jarring. Especially the 5-10 seconds most AIs take these days, where it's just enough to make switching tasks not worth it. You just have to sit there in statis. If I'm willing to accept any latency, might as well make it a couple minutes in the background, and use a full agent mode or deep research AI at that point. Otherwise I want instant.

How I got a Root Shell on a Credit Card Terminal

M8.2 solar flare, Strong G4 geomagnetic storm watch

Atari Means Business with the Mega ST

Cinematography of "Andor"

Figma Slides Is a Beautiful Disaster

Why DeepSeek is cheap at scale but expensive to run locally

Ask HN: How Are Parents Who Program Teaching Their Kids Today?

A new generation of Tailscale access controls

When Fine-Tuning Makes Sense: A Developer's Guide

Dear diary, today the user asked me if I'm alive

Learning from the Amiga API/ABI

RenderFormer: Neural rendering of triangle meshes with global illumination

Progressive JSON

RSC for Lisp Developers

How I like to install NixOS (declaratively)

Google AI Edge – on-device cross-platform AI deployment

Father Ted Kilnettle Shrine Tape Dispenser

Structured Errors in Go (2022)

Browser extension (Firefox, Chrome, Opera, Edge) to redirect URLs based on regex

Show HN: Patio – Rent tools, learn DIY, reduce waste

A Beautiful Technique for Some XOR Related Problems

New adaptive optics shows details of our star's atmosphere

Ovld – Efficient and featureful multiple dispatch for Python

A Pokémon battle simulation engine

Show HN: A Implementation of Alpha Zero for Chess in MLX

Stepping Back

Roborock Saros Z70 Review: This Robovac's Robotic Arm Is a Swing and a Miss

Why Use Structured Errors in Rust Applications?

A Lean companion to Analysis I

CCD co-inventor George E. Smith dies at 95

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B)

Comments

How I got a Root Shell on a Credit Card Terminal

M8.2 solar flare, Strong G4 geomagnetic storm watch

Atari Means Business with the Mega ST

Cinematography of "Andor"

Figma Slides Is a Beautiful Disaster

Why DeepSeek is cheap at scale but expensive to run locally

Ask HN: How Are Parents Who Program Teaching Their Kids Today?

A new generation of Tailscale access controls

When Fine-Tuning Makes Sense: A Developer's Guide

Dear diary, today the user asked me if I'm alive

Learning from the Amiga API/ABI

RenderFormer: Neural rendering of triangle meshes with global illumination

Progressive JSON

RSC for Lisp Developers

How I like to install NixOS (declaratively)

Google AI Edge – on-device cross-platform AI deployment

Father Ted Kilnettle Shrine Tape Dispenser

Structured Errors in Go (2022)

Browser extension (Firefox, Chrome, Opera, Edge) to redirect URLs based on regex

Show HN: Patio – Rent tools, learn DIY, reduce waste

A Beautiful Technique for Some XOR Related Problems

New adaptive optics shows details of our star's atmosphere

Ovld – Efficient and featureful multiple dispatch for Python

A Pokémon battle simulation engine

Show HN: A Implementation of Alpha Zero for Chess in MLX

Stepping Back

Roborock Saros Z70 Review: This Robovac's Robotic Arm Is a Swing and a Miss

Why Use Structured Errors in Rust Applications?

A Lean companion to Analysis I

CCD co-inventor George E. Smith dies at 95