TransMLA: Multi-head latent attention is all you need

67•ocean_moist•6h ago

Comments

olq_plo•3h ago

Very cool idea. Can't wait for converted models on HF.

kavalg•3h ago

My (possibly wrong) TLDR: TransMLA is a method to "compress" an already trained GQA model, with the additional option to further fine tune it. Shall make inference faster.

freeqaz•2h ago

Also makes models smarter ("expressive")

yorwba•2h ago

It is not a method to compress a Grouped-Query Attention model, but to expand it into an equivalent Multi-head Latent Attention model with the same key-value cache size but larger effective key/value vectors and a correspondingly larger number of trainable parameters. With additional training, you can then obtain a better model that only uses a little bit more memory.

kristel100•1h ago

Still wrapping my head around this architecture, but the idea of reducing headcount while maintaining performance is compelling. Would love to see a benchmark against something like FlashAttention.

octocop•1h ago

These titles need to stop, we've seen that in fact it is not all you need.

tankenmate•1h ago

The title of this paper is a reference to a previous paper titled "Attention Is All You Need"[0][1]. This seminal work described the transformer model that is the basis for almost all LLMs, and is almost certainly the most cited paper on AI even though it was only published in 2017.

[0] https://arxiv.org/abs/1706.03762 [1] https://en.wikipedia.org/wiki/Attention_Is_All_You_Need

kristopolous•27m ago

Right, it's an 8 year old reference that's been made hundreds of times.

People seem to love going to the references graveyard, digging up tired and dead ones and drag them around town hoping everyone thinks they're clever.

Also this was from 3 months ago.

seeknotfind•36m ago

All you need titles stopping is all you need.

EGreg•22m ago

We need more than that, and all you need to stop saying that!!

Etheryte•6m ago

All you need is love, and for these titles to stop. (But they won't do that.)

wiz21c•42m ago

Not quite related, but do the mamba models gain ground ?

Answering my own question: https://www.reddit.com/r/MachineLearning/comments/1hpg91o/d_...

EGreg•21m ago

All you need to stop posting titles like that !

Firefox moves to GitHub

FastVLM: Efficient vision encoding for vision language models

A programming language made for me

Ask HN: How are you acquiring your first hundred users?

TransMLA: Multi-head latent attention is all you need

Open Hardware Ethernet Switch project, part 1

Anti-Personnel Computing (2023)

15 Years of Shader Minification

The Barbican

Air Traffic Control

Alephic Writing Style Guide

Revisiting Image Maps

We Fixed 2k+ Security Issues (2023)

Can you trust that permission pop-up on macOS?

A conversation about AI for science with Jason Pruet

Understanding LucasArts' iMUSE System

RIP Usenix ATC

How to avoid P hacking

A community-led fork of Organic Maps

HealthBench – An evaluation for AI systems and human health

University of Texas-led team solves a big problem for fusion energy

Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL

Reviving a modular cargo bike design from the 1930s

Writing N-body gravity simulations code in Python

Trial by Fire: The crash of Aeroflot flight 1492

Ruby 3.5 Feature: Namespace on read

Offline vs. online ML pipelines

NASA study reveals Venus crust surprise

FedRAMP 20x – One Month in and Moving Fast

Wtfis: Passive hostname, domain and IP lookup tool for non-robots

Firefox moves to GitHub

FastVLM: Efficient vision encoding for vision language models

A programming language made for me

Ask HN: How are you acquiring your first hundred users?

TransMLA: Multi-head latent attention is all you need

Open Hardware Ethernet Switch project, part 1

Anti-Personnel Computing (2023)

15 Years of Shader Minification

The Barbican

Air Traffic Control

Alephic Writing Style Guide

Revisiting Image Maps

We Fixed 2k+ Security Issues (2023)

Can you trust that permission pop-up on macOS?

A conversation about AI for science with Jason Pruet

Understanding LucasArts' iMUSE System

RIP Usenix ATC

How to avoid P hacking

A community-led fork of Organic Maps

HealthBench – An evaluation for AI systems and human health

University of Texas-led team solves a big problem for fusion energy

Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL

Reviving a modular cargo bike design from the 1930s

Writing N-body gravity simulations code in Python

Trial by Fire: The crash of Aeroflot flight 1492

Ruby 3.5 Feature: Namespace on read

Offline vs. online ML pipelines

NASA study reveals Venus crust surprise

FedRAMP 20x – One Month in and Moving Fast

Wtfis: Passive hostname, domain and IP lookup tool for non-robots

TransMLA: Multi-head latent attention is all you need

Comments