Trinity large: An open 400B sparse MoE model

https://www.arcee.ai/blog/trinity-large

231•linolevan•1w ago

Comments

linolevan•1w ago

I'm particularly excited to see a "true base" model to do research off of (https://huggingface.co/arcee-ai/Trinity-Large-TrueBase).

hahahahhaah•1w ago

I'd love to "chat" to that model see how it behaves

Grimblewald•1w ago

I highly recommend. As a tip, you can quite easily get into a chat like state by simply using in context learning. Have a few turns of conversation pre-written and generate from that. It'll continue the conversation (for both parties) so you just stop it from generating when it starts generating on your behalf.

That said, it's useful for so much more beyond. Outline the premise of a Book, then "what follows is that book\n #Chapter 1:" and watch it rip. Base models are my preferred way of using LLM's by a long margin.

peepee1982•1w ago

I've done this out of curiosity with the base model of LLama 3.1 405B. I vibe coded a little chat harness with the system prompt being a few short conversations between "system" and "user" with "user:" being the stop word so I could enter my message. Worked surprisingly well and I didn't get any sycophancy or cliched AI responses.

mynti•1w ago

They trained it in 33 days for ~20m (that includes apparently not only the infrastructure but also the salaries over a 6 month period). And the model is coming close to QWEN and Deepseek. Pretty impressive

zamadatix•1w ago

The price/scaling of training another same class model always seems to be dropping through the floor but training models which score much better seems to be hitting a brick wall.

E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

The exception seems to be net new benchmarks/benchmark versions. These start out low and then either quickly get saturated or hit a similar wall after a while.

gwern•1w ago

> E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

Why do you care about LM Arena? It has so many problems, and the fact that no one would suggest using GPT-4o for doing math or coding right now, or much of anything, should tell you that a 'win rate of 70%' does not mean whatever it looks like it means. (Does GPT-4o solve roughly as many Erdos questions as gemini-3-pro...? Can you write roughly as good poetry?)

zamadatix•1w ago

It'd certainly be odd if people were recommending old LLMs which score worse, even if marginally. That said, 4o is really a lot more usable than you're making it out to be.

The particular benchmark in the example is fungible but you have to pick something to make a representative example. No matter which you pick someone always has a reason "oh, it's not THAT benchmark you should look at". The benchmarks from the charts in the post exhibit the same as described above.

If someone was making new LLMs which were consistently solving Erdos problems at rapidly increasing rates then they'd be showing how it does that rather than showing how it scores the same or slightly better on benchmarks. Instead the progress is more like years since we were surprised LLMs were writing poetry to massage out an answer to one once. Maybe by the end of the year a few. The progress has definitely become very linear and relatively flat compared to roughly the initial 4o release. I'm just hoping that's a temporary thing rather than a sign it'll get even flatter.

refulgentis•1w ago

Frankly, this reads as a lot of words that amount to an excuse for using only LMArena, and the rationale is quite clear: it’s for an unrelated argument that isn’t going to ring true to people, especially an audience of programmers who just spent the last year watching the AI go from being able to make coherent file edits to multi hour work.

LMArena is, de facto, a sycophancy and Markdown usage detector.

Two others you can trust, off the top of my head, are LiveBench.ai and Artifical Analysis. Or even Humanity’s Last Exam results. (Though, frankly, I’m a bit suspicious of them. Can’t put my finger on why. Just was a rather rapid hill climb for a private benchmark over the last year.)

FWIW GPT 5.2 unofficial marketing includes the Erdos thing you say isn’t happening.

zamadatix•1w ago

I've always found LiveBench a bit confusing to try to compare over time as the dataset isn't meant to be compared over time. It also currently claims GPT-5 Mini High from last summer is within ~15% of Claude 4.5 Opus Thinking High Effort in the average, but I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up (or, more likely, be told in 6 months how these 2 benchmarks weren't the ones that should matter either). Artificial Analysis at least has the same at 20% from the top, so maybe that's the one we all agree to use for now since it implies faster growth.

> FWIW GPT 5.2 unofficial marketing includes the Erdos thing you say isn’t happening.

Certainly not, unless you're about to tell me I can pop into ChatGPT and pop out Erdos proofs regularly since #728 was massaged out with multiple prompts and external tooling a few weeks ago - which is what I was writing about. It was great, it was exciting, but it's exactly the slow growth I'm talking about.

I like using LLMs, I use them regularly, and I'm hoping they continue to get better for a long time... but this is in no way the GPT 3 -> 3.5 -> 4 era of mind boggling growth of frontier models anymore. At best, people are finding out how to attach various tooling to the models to eek more out as the models themselves very slowly improve.

nl•1w ago

> I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up

Appstore releases were roughly linear until July 25 and are up 60% since then:

https://www.coatue.com/c/takes/chart-of-the-day-2026-01-22

refulgentis•1w ago

One of the best surgically executed nukes on HN in my 16 years here.

zamadatix•1w ago

I never claimed people don't make apps with AI. Of course it does - I can do that in a few clicks and some time with most any provider. You've been able to do that for a few years now, and that (linear) trend line starts over a year ago.

I can guarantee if you restricted yourself to just that 60% you wouldn't be responding to me doubting AI apps are already amazing things people are actually supposed to be so excited about using though.

refulgentis•1w ago

See peer reply re: yes, your self-chosen benchmark has been reached.

Generally, I've learned to warn myself off of a take when I start writing emotionally charged stuff like [1]. Without any prompting (who mentioned apps? and why would you without checking?), also, when reading minds, and assigning weak arguments, now and in my imagination of the future. [2]

At the very least, [2] is a signal to let the keyboard have a rest, and ideally my mind.

Bailey: > "If [there were] new LLMs...consistently solving Erdos problems at rapidly increasing rates then they'd be showing...that"

Motte: > "I can['t] pop into ChatGPT and pop out Erdos proofs regularly"

No less than Terence Tao, a month ago, pointing out your bailey was newly happening with the latest generation: https://mathstodon.xyz/@tao/115788262274999408. Not sure how you only saw one Erdos problem.

[1] "I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up"

[2] "...or, more likely, be told in 6 months how these 2 benchmarks weren't the ones that should matter either"

zamadatix•1w ago

I'm going to stick to the stuff around Tao, as even well tempered discussion about the rest would be against the guidelines anyways.

I had a very different read of Tao's post last month. To me, he opens that there have been many claims of novel solutions which turn out to be known solutions from publications buried for years, but nothing about rapid increase in the rates or even claims mathematicians using LLMs are having most of the work done by them yet.

He speculates, and I also assume correctly as well, that that contaminations are not the only reason. Indeed, we've seen at least 1 novel solution which couldn't have come from a low interest publication being in the training data alone. How many of the 3 examples at the top end up actually falling that way is not really something anyone can know, but I agree it should be safe to assume the answer will not be 0, or even if it was it would seem unreasonable to think it stayed that way. These solutions are coming out of systems of which the LLM is a part, and very often a mathematician still actually orchestrating.

None of these are just popping in a prompt and hoping for the answer, nor will you get an unknown solution to an LLM by going to ChatGPT 5.2 Pro and asking it without the rest of the story (and even then, you still will not get such a solution regularly, consistently, or at a massively higher rate than several months ago). They are multishot from experts with tools. Tao makes a very balanced note of this in reply to his main message:

> The nature of these contributions is rather nuanced; individually and collectively, they do not meet the hyped up goal of AI autonomously solving major mathematical open problems, but they also cannot all be dismissed as inconsequential trickery.

It's exciting, and helpful, but it's slow and he doesn't even think we're truly actually at "AI solves some Erdos problems" yet, let alone "AI solves Erdos problems regularly and at a rapidly increasing rate".

refulgentis•1w ago

"...as even well tempered discussion about the rest would be against the guidelines anyways."

Didn't bother reading after that. I deeply respect you have the self-awareness to notice and spare us, that's rare. But it also means we all have to have conversations purely on your terms, and because its async, the rules constantly change post-hoc.

And that's on top of the post-hoc motte / bailey instances, of which we have multiple. I was stunned (stunned!!) by the attempted retcon of the app claim once there were numbers.

Anyways, all your bete noirs aside, all your Red Team vs. Blue Team signalling aside, using LMArena alone as a benchmark is a bad idea.

zamadatix•6d ago

The conversation is certainly not on "my terms" as I didn't write the guidelines (nor do they benefit me more than anyone else). If you are genuinely concerned with the conversation, please flag it and/or email hn@ycombinator.com and they will (genuinely) handle it appropriately. Otherwise there is not much else which can be said around this here.

If not, continuing to have a conversation can only happen if we want to discuss the recent growth rate of AI and take the time to read what each other write. Similarly, async conversation can be as clear and consistent as we want it to be - we just have to take the time to ask for clarification before writing a response on something we feel could be a movable understanding. Nothing is meant to be unclear as a "gotcha" and I'll always be glad to clarify before moving on.

I also agree nobody should rely solely on LM Arena for benchmarks, which is not what starting a conversation by using it in an example was meant to imply we need to do. I'd love to continue chatting more about other benchmarks and how you see Tao's comments, as you seem to have walked away from reading them with a very different understanding than I did.

nl•1w ago

Progress has not become linear. We've just hit the limits of what we can measure and explain easily.

One year ago coding agents could barely do decent auto-complete.

Now they can write whole applications.

That's much more difficult to show than an ELO score based on how people like emjois and bold text in their chat responses.

Don't forget Llama4 led Lmarena and turned out to be very weak.

dajonker•1w ago

You are equally understating past performance as you are overstating current performance.

One year ago I already ran qwen2.5-coder 7B locally for pretty decent autocomplete. And I still use it today as I haven't found anything better, having tried plenty of alternatives.

Today I let LLM agents write probably 60-80% of the code, but I frequently have to steer and correct it and that final 20% still takes 80% of the time.

anon373839•1w ago

Much of these gains can be attributed to better tooling and harnesses around the models. Yes, the models also had to be retrained to work with the new tooling, but that doesn’t mean there was a step change in their general “intelligence” or capabilities. And sure enough, I’m seeing the same old flaws as always: frontier models fabricating info not present in the context, having blindness to what is present, getting into loops, failing to follow simple instructions…

nl•1w ago

> Much of these gains can be attributed to better tooling and harnesses around the models.

This isn't the case.

Take Claude Code and use it with Haiku, Sonnet and Opus. There's a huge difference in the capabilities of the models.

> And sure enough, I’m seeing the same old flaws as always: frontier models fabricating info not present in the context, having blindness to what is present, getting into loops, failing to follow simple instructions…

I don't know what frontier models you are using but Opus and Codex 5.2 don't ever do these things for me.

DoctorOetker•1w ago

It very sad there is so much gaming of metrics with LLMs.

If we wish to avoid everyone creating benchmarks for themselves, then instead of predetermined benchmarks (public ones allow gaming, while publicly scored private ones require blind trust) we could use gradient descent on sentences to find disagreements between models, and then present them to human domain experts.

At least it could be public without possibility of leaking (since the model creators don't yet know of all possible disagreements between LLM's, which ones will be selected for review by human experts)

YetAnotherNick•1w ago

> The exception seems to be net new benchmarks/benchmark versions.

How is this an exception? If a genius and kindergarden student takes a test to add two single digit numbers how is that result any relevant? Even though adding single digit number is in the class of possible test.

We can only look at non saturated test.

Zababa•1w ago

>E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

I think in that specific case that says more about LMArena than about the newer models. Remember that GPT 4o was so specifically loved by people that when GPT 5 replaced there was lots of backlash against OpenAI.

One of the popular benchmarks right now is METR which shows some real improvement with newer models, like Opus 4.5. Another way of getting data is anecdotes, lots of people are really impressed with Opus 4.5 and Codex 5.2 (but they're hard distangle from people getting better with those tools, the scaffolding (Claude code, Codex) getting better, and lots of other stuff). SWEBench is still not saturated (less than 75% I think).

lumost•1w ago

It’s becoming clear that training a frontier model is a capex/infra problem. This problem involves data acquisition, compute, and salaries for the researchers familiar with the little nuances of training at this scale.

For the same class model, you can train on more or less the same commodity datasets. Over time these datasets become more efficient to train on as errata are removed and the data is cleaner. The cost of dataset acquisition can be amortized and sometimes drops to 0 as the dataset is open sourced.

Frontier models mean acquiring fresh datasets at unknown costs.

esskay•1w ago

Training costs might be coming down but costs for hardware that can run these models is still obscenely high and rising. We're still nowhere near a point where its realistically feasible to run a home LLM that doesn't feel like it's suffering with severe brain damage.

jychang•1w ago

They didn't do something stupid like Llama 4 "one active expert", but 4 of 256 is very sparse. It's not going to get close to Deepseek or GLM level performance unless they trained on the benchmarks.

I don't think that was a good move. No other models do this.

Der_Einzige•1w ago

I'll straight up accuse them of on purpose muddying the waters. To get to the point of executing a successful training run like that, you have to count every failed experiment and experiment that gets you to the final training run. They spent well over 100 Million to train this model by that definition, and all definitions which don't include the failed runs up to the successful one at the end are at best disingenuous and at worst outright lies designed to trick investors into dumping Nvidia.

No, deepseek did not spend only 5.5 million for Deepseek V3. No Gemini was not "entirely trained on TPUs". They did hundreds of experiments on GPUs to get to the final training run done entirely on TPUs. GCP literally has millions of GPUs and you bet your ass that the gemini team has access to them and uses them daily. Deepseek total cost to make Deepseek V3 is also in the 100-400 million range when you count all of what's needed to get to the final training run.

Edit: (Can't post cus this site's "posting too fast" thing is really stupid/bad)

The only way I can get reliable information out of folks like you is to loudly proclaim something wrong on the internet. I'm just going to even more aggressively do that from now on to goad people like you to set the record straight.

Even if they only used TPUs, they sure as shit spent orders of magnitude more than they claim due to "count the failed runs too"

querez•1w ago

> No Gemini was not "entirely trained on TPUs". They did hundreds of experiments on GPUs to get to the final training run done entirely on TPUs. GCP literally has millions of GPUs and you bet your ass that the gemini team has access to them and uses them daily.

You are wrong. Gemini was definitely trained entirely on TPU. Of course your point of "you need to count failed experiments, too". Is correct. But you seem to have misconceptions around how deepmind operates and what infra it possess. Deepmind (or barely any of Google internal stuff) runs on Borg, an internal cloud system, which is completely separate (and different) from gcp. Deepmind does not have access to any meaningful gcp resources. And Borg barely has any GPUs. At the time I left deepmind, the amount of tpu compute available was probably 1000x to 10000x larger than the amount of gpu compute. You would never even think of seriously using GPUs for neural net training, it's too limited (in terms of available compute) and expensive (in terms of internal resource allocation units), and frankly less well supported by internal tooling than tpu. Even for small, short experiments, you would always use TPUs.

hansvm•1w ago

At least blessed teams we used GPUs when we were allowed, else CPUs. TPUs were basically banned in YT since they were reserved for higher priority purposes. Gemini was almost certainly trained with one, but I guarantee an ungodly amount of compute has gone into training neural nets with CPUs and GPUs.

YetAnotherNick•1w ago

Using TPU has the same opportunity cost as GPU. Just because they built something doesn't mean it's cheaper. If it is they can rent it cheaper to save money on paying billions of dollars to Nvidia.

A big segment of the market just uses GPU/TPU to train LLMs, so they don't exactly need flexibility if some tool is well supported.

querez•1w ago

I assume TPU TCO is significantly cheaper than GPU TCO. At the same time, I also assume that market demand for GPUs is higher than TPUs (external tooling is just more suited to GPU -- e.g. I'm not sure what the Pytorch-on-TPU story is these days, but I'd be astounded if it's on par with their GPU support). So moving all your internal teams to TPUs means that all the GPUs can be allocated to GCP.

YetAnotherNick•1w ago

Just doesn't make sense. If you make significantly more money renting TPU, why not rent them cheaper to shift the customers(and save billions that you are giving to Nvidia). TPU right now isn't significantly more cheaper to external customer.

Again I am talking about LLM training/inference which if I were to guess is more than half of the workload currently for which the switching cost is close to 0.

Zababa•1w ago

>To get to the point of executing a successful training run like that, you have to count every failed experiment and experiment that gets you to the final training run.

I get the sentiment, but then, do you count all the other experiments that were done by that company before specifically trying to train this model? All the experiments done by people in that company at other companies? Since they rely on that experience to train models.

You could say "count everything that has been done since the last model release", but then for the same amount of effort/GPU, if you release 3 models does that divide each model cost by 3?

Genuinely curious in how you think about this, I think saying "the model cost is the final training run" is fine as it seems standard ever since DeepSeek V3, but I'd be interested if you have alternatives. Possibly "actually don't even talk about model cost as it will always be misleading and you can never really spend the same amount of money to get the same model"?

maziyar•1w ago

i think it's very flattering to have done something with $20m that is so good people think it must have been a $100m!

iberator•1w ago

Why even do such thing if there is free Google, chatgpt and dozen more models? Waste of money towards ultimate goal: global loss of jobs and destroying earth.

tgrowazay•1w ago

> 2048 Nvidia B300 GPU

With average price of $6/hour that is $12,288/hour for whole cluster.

Times 33 days times 24 hours it comes out to be $9.7MM , assuming no discounts.

That leaves $10.3MM/6 months for salaries, which is 103 employees at $200k/year or 51 employee at $400k/year.

YetAnotherNick•1w ago

It would likely be something like $4.5/hour for this big cluster.

[1]: https://verda.com/products#B300

zamadatix•1w ago

It mentions it took 4 models to get there, so would that mean there were additional runs (and other steps/overheads) which were part of the cost separate from just the salaries in that time?

esafak•1w ago

I tried it a bit yesterday and it was pretty dumb: it failed to understand the order of jobs in a Github Action; i.e., a DAG. And that concluded my testing.

observationist•1w ago

This is a wonderful release.

frogperson•1w ago

What exactly does "open" mean in this case? Is it weights and data or just weights?

someotherperson•1w ago

It's always open weights.

jetpackjoe•1w ago

It's never open data

jacquesm•1w ago

Well, it is, it's your data to begin with after all but admitting that would create some problems.

linolevan•1w ago

This model is sort of interesting since it seems to be using a lot of synthetic training data – but your point stands

cyanydeez•1w ago

So it's a rip off of a rip off, is that whats interesting?

freakynit•1w ago

reminds of this recent news https://www.medianama.com/2026/01/223-nvidia-high-speed-acce...

tucnak•1w ago

unless you're Ai2

mwcampbell•1w ago

Given that it's a 400B-parameter model, but it's a sparse MoE model with 13B active parameters per token, would it run well on an NVIDIA DGX Spark with 128 GB of unified RAM, or do you practically need to hold the full model in RAM even with sparse MoE?

timschmidt•1w ago

Even with MoE, holding the model in RAM while individual experts are evaluated in VRAM is a bit of a compromise. Experts can be swapped in and out of VRAM for each token. So RAM <-> VRAM bandwidth becomes important. With a model larger than RAM, that bandwidth bottleneck gets pushed to the SSD interface. At least it's read-only, and not read-write, but even the fastest of SSDs will be significantly slower than RAM.

That said, there are folks out there doing it. https://github.com/lyogavin/airllm is one example.

nick49488171•1w ago

With a non-sequential generative approach perhaps the RAM cache misses could be grouped together and swapped on a when available/when needed prioritized bases.

radarsat1•1w ago

> Experts can be swapped in and out of VRAM for each token.

I've often wondered how much it happens in practice. What does the per-token distribution of expert selection actually look like during inference? For example does it act like uniform random variable, or does it stick with the same 2 or 3 experts for 10 tokens in a row? I haven't been able to find much info on this.

Obviously it depends on what model you are talking about, so some kind of survey would be interesting. I'm sure this must but something that the big inference labs are knowledgeable about.

Although, I guess if you are batching things, then even if a subset of experts is selected for a single query, maybe over the batch it appears completely random, that would destroy any efficiency gains. Perhaps it's possible to intelligently batch queries that are "similar" somehow? It's quite an interesting research problem when you think about it.

Come to think of it, how does it work then for the "prompt ingestion" stage, where it likely runs all experts in parallel to generate the KV cache? I guess that would destroy any efficiency gains due to MoE too, so the prompt ingestion and AR generation stages will have quite different execution profiles.

yorwba•1w ago

The model is explicitly trained to produce as uniform a distribution as possible, because it's designed for batched inference with a batch size much larger than the expert count, so that all experts are constantly activated and latency is determined by the highest-loaded expert, so you want to distribute the load evenly to maximize utilization.

Prompt ingestion is still fairly similar to that setting, so you can first compute the expert routing for all tokens, load the first set of expert weights and process only those tokens that selected the first expert, then load the second expert and so on.

But if you want to optimize for single-stream token generation, you need a completely different model design. E.g. PowerInfer's SmallThinker moved expert routing to a previous layer, so that the expert weights can be prefetched asynchronously while another layer is still executing: https://arxiv.org/abs/2507.20984

radarsat1•1w ago

Thanks, really interesting to think about these trade-offs.

Gracana•1w ago

I thought paging was so inefficient that it wasn't worth doing vs using CPU inference for the parts of the model that are in system memory. Maybe if you have a good GPU and a turtle of a CPU, but still somehow have the memory bandwidth to make shuffling data in and out of the GPU worthwhile? I'm curious to know who is doing this and why.

antirez•1w ago

Can run with mmap() but it is slower. 4-bit quantized there is a decent ratio between the model size and the RAM, with a fast SSD one could try to see how it works. However when a model is 4-bit quantized there is often the doubt that it is not better than an 8-bit quantized model of 200B parameters, it depends on the model, on the use case, ... Unfortunately the street for local inference of SOTA model is being stopped by the RAM prices and the GPU request of the companies, leaving us with little. Probably today the best bet is to buy Mac Studio systems and then run distributed inference (MLX supports this for instance), or a 512 GB Mac Studio M4 that costs, like 13k$.

notpublic•1w ago

Talking about RAM prices, you can still get a framework Max+ 395 with 128GB RAM for ~$2,459 USD. They have not increased the price for it yet.

https://frame.work/products/desktop-diy-amd-aimax300/configu...

Scipio_Afri•1w ago

Pretty sure those use to be $1999 ... but not entirely sure

notpublic•1w ago

Yep. You be right. Looks like they increased it earlier this month. Bummer!

vardump•1w ago

I think 512 GB Mac Studio was M3 Ultra.

Anyways, isn't a new Mac Studio due in a few months? It should be significantly faster as well.

I just hope RAM prices don't ruin this...

jychang•1w ago

No.

128GB vram gets you enough space for 256B sized models. But 400B is too big for the DGX Spark, unless you connect 2 of them together and use tensor parallel.

greggh•1w ago

The only thing I question is the use of Maverick in their comparison charts. That's like comparing a pile of rocks to an LLM.

eldenring•1w ago

There aren't too many base models out there to compare against.

jychang•1w ago

It's because they're doing 4 of 256 sparsity, which was a bad decision caused by financial limitations.

Training cost (FLOPs) = 6 * active params * total tokens. By keeping the MoE experts param count low, it reduces total training costs.

I don't think this was a good move. They should have just trained way past chinchilla like the other major labs, and keep sparsity above 2%. Even Kimi K2 is above 2%. GLM is at 5%, which makes it very expensive (and high performing) for its small size.

Arcee went the other way. They trained a massive 400b model (bigger than GLM-4.5/4.6/4.7, bigger than Qwen3 235b A23b), but only have 17b active params, which is smaller than Qwen and GLM. It's also only trained on 17T tokens, vs 20-30T+ tokens for the other models. It's just undertrained and undersized (in terms of active parameters), and they got much worse performance than those models:

https://45777467.fs1.hubspotusercontent-na1.net/hubfs/457774...

It's not a bad showing considering the limitations they were working with, but yeah they definitely need double the active experts (8 out of 256 instead of 4 out of 256) to be competitive. That would roughly double the compute cost for them, though.

Their market strategy right now is to have less active params so it's cheaper for inference, more total params so it's smarter for the amount of active params they have, but not too big to fit into a H200 cluster. I... guess this is a valid niche strategy? The target audience is basically "people who don't need all the intelligence of GLM/Qwen/Deepseek, but want to serve more customers on the H200 cluster they already have sitting around". It's a valid niche, but a pretty small one.

Alifatisk•1w ago

What did they do to make the loss drop so much in phase 3?

Also, why are they comparing with Llama 4 Maverick? Wasn’t it a flop?

QuadmasterXLII•1w ago

you can’t directly compare losses because they changed the data distribution for each phase ( I think. 100% guaranteed they change the data distribution after the 10 trillion token mark, that’s when they start adding in instruction following data, but I don’t know for sure if the other phase changes also include data distribution changes.)

observationist•1w ago

```During development of the RSDB, we noted significant enough performance gains from it that we decided to integrate it during phase 3 of the Trinity Large training run instead of waiting for a later training run. While the data distributions between phase 2 and phase 3 make direct comparison difficult, the overall effect was notable: BatchHet reduced by a factor of 4.23x, and step-to-step variance reduced by a factor of 2.4x (see Figure 1), a significant improvement when compared to the default packing strategy. We note that training runs without the RSDB exhibit much higher values in the higher-order moments of the running loss distribution, which we believe to correlate with network instability during training. ```

Page 9 of the technical report has more details, but it looks like they found some data prep methods as well as some other optimizations that overall worked out really well. I don't think it was any one particular thing.

As far as Llama 4 goes, it was only referenced as a similarly sized model, they called it one of their model "peers"; I don't think they intended any sort of quality comparison. Llama 4 was notable for sparsity, despite its poor performance and reception, some of the things they achieved technically were solid, useful research.

bartowski•1w ago

comparing to Maverick is probably largely around comparing to the only other north american model that comes close to its size

considering this is a preview of the instruct and it's spitting distance from maverick, it's likely to showcase "look what we can do with limited funds, imagine what we can do with more"

syntaxing•1w ago

So refreshing to see open source models like this come from the US. I would love for a 100Bish size one that can compete against OSS-120B and GLM air 4.5

0xdeadbeefbabe•1w ago

Is anyone excited to do ablative testing on it?

manbitesdog•1w ago

With such a high throughput because of sparsity, I'm particulary interested in distilling it into other architectures. I'd like to try a recurrent transformer when I have the time

fuddle•1w ago

> We optimize for performance per parameter and release weights under Apache-2.0

How do they plan to monetize?

lambda•1w ago

I'm guessing by selling fine-tuning, consulting on hosting, and other services? They also seem to be offering their own inference service with their model, obviously as an open weight model that will be commoditized but I'm sure there are some people who'd prefer to buy from the originating lab. But yeah, when you're offering open weights models, your customers are going to be people who want to self-host, fine tune, etc, so they might be offering services for that.

tcdent•1w ago

It's super exciting to see another American lab get in the ring. Even if they're not at SOTA on the first release, the fact that they're trying is incredible for open source AI.

khimaros•1w ago

unsloth quants are up https://huggingface.co/unsloth/Trinity-Large-Preview-GGUF

LoganDark•1w ago

According to the article, nearly 50% of the dataset is synthetic (8T out of 17T tokens). I don't know what constitutes "a breadth of state-of-the-art rephrasing approaches", but I lack some confidence in models trained on LLM output, so I hope it wasn't that.

NitpickLawyer•1w ago

> but I lack some confidence in models trained on LLM output, so I hope it wasn't that.

That's misguided. Models have been trained on synthetic data for ~2+ years already. The "model collapse" myth is based on a very poor paper that got waaaay more attention than it deserved (because negativity sells, I guess). In practice every lab out there is doing this, because it works.

LoganDark•1w ago

When ChatGPT first released and jailbreaks were pretty easy, I was able to easily get some extremely good/detailed output from it, with very little errors or weirdness. Now even when I can get jailbreaks to work with their newer models, it's just not the same, and no open-source model or even commercial model has seem to come close to the quality of that very first release. They're all just weird, dumb, random or incoherent. I keep trying even the very large open-source or open-weights models, and new versions of OpenAI's models and Claude and Gemini and so on, but it just all sucks. It all feels like slop!

I'm convinced it's because that first ChatGPT release was probably trained on data almost entirely untainted by other LLMs, and it may no longer ever be possible to obtain such a dataset again. Every model feels so artificial and synthetic. I do not know for sure why this is, but I bet it has something to do with people thinking it's possible to programmatically generate almost half the dataset?! I feel like OpenAI's moat could have been the quality and authenticity of their dataset, since they scraped practically most of the internet before LLMs became widespread, but even they've probably lost it by now.

I haven't really internalized anything about "model collapse", other than that if you train an LLM on outputs from other LLMs, you will be training to emulate an imprecise version of an imprecise version of writing, which will be measurably and perceptibly worse than merely one layer of imprecise version of actual writing.

wuschel•1w ago

> I'm convinced it's because that first ChatGPT release was probably trained on data almost entirely untainted by other LLMs, and it may no longer ever be possible to obtain such a dataset again.

Interesting statement. But wouldn’t that mean that Google is in an even better position in regard to primary, or at least pristine data?

kristianp•1w ago

There's a free preview on openrouter: https://openrouter.ai/arcee-ai/trinity-large-preview:free

trilogic•1w ago

Testing it now in HugstonOne. Running smooth at 5.8 T/S : Loaded Trinity-Large-Preview-UD-Q4_K_XL-00001-of-00005.gguf.

The T/S speed is acceptable, also stable 60 degrees celcius for the gpu temperature. Accuracy and precision in math problems. So far so good. Results: https://www.reddit.com/r/Hugston/comments/1qq9d5i/testing_tr...

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

Speed up responses with fast mode

Software factories and the agentic moment

Total surface area required to fuel the world with solar (2009)

Bye Bye Humanity: The Potential AMOC Collapse

LLMs as the new high level language

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Vocal Guide – belt sing without killing yourself

First Proof

Vouch

Why there is no official statement from Substack about the data leak

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Homeland Security Spying on Reddit Users

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

FDA intends to take action against non-FDA-approved GLP-1 drugs

The AI boom is causing shortages everywhere else

Learning from context is harder than we thought

Where did all the starships go?

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Selection rather than prediction

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

I write games in C (yes, C) (2016)

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

Speed up responses with fast mode

Software factories and the agentic moment

Total surface area required to fuel the world with solar (2009)

Bye Bye Humanity: The Potential AMOC Collapse

LLMs as the new high level language

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Vocal Guide – belt sing without killing yourself

First Proof

Vouch

Why there is no official statement from Substack about the data leak

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Homeland Security Spying on Reddit Users

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

FDA intends to take action against non-FDA-approved GLP-1 drugs

The AI boom is causing shortages everywhere else

Learning from context is harder than we thought

Where did all the starships go?

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Selection rather than prediction

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

I write games in C (yes, C) (2016)

Trinity large: An open 400B sparse MoE model

Comments