Mappa – Fine-tune ANY multi-agent LLM systems end-to-end with AI coaches

3•junyuren•1h ago

Blog: https://ltjed.github.io/MAPPA/ Paper: https://arxiv.org/abs/2601.23228 Code: https://github.com/ltjed/multiagent-coaching Twitter: https://x.com/t_ed_li/status/2019114121250370021

Comments

junyuren•1h ago

Author here. Happy to answer questions.

The problem: when you have multiple LLM agents working together and something fails, which agent is responsible? Traditional RL gives you one reward at the end, so all agents share the blame equally.

Our approach: an external LLM (we used Gemini) watches each agent's actions and tool outputs, then assigns per-action scores. When agent 3 crashes because agent 1 forgot to save a file, the coach traces back through the tool outputs and blames agent 1, not agent 3.

This gives you dense training signal without needing ground truth labels. The coach provides the supervision.

Practical angle: you use the API calls only during training. Afterward you have a team of local models that run offline. We tested with Qwen and LLaMA base models.

Results: +17pp on AIME math competition, +38% F1 on Kaggle-style data science tasks.

Hardware requirement is 2-8x 80GB GPUs depending on model size. Code is MIT licensed.

The framework is general - plug in your own agents, your own task, your own coach model.

ed_li•1h ago

Does MAPPA work for law?

Show HN: Lockin, a PDF TTS reader for manuals and papers cited Q&A

How to Make Package Managers Scream (FOSDEM'26)

A Journey into Understanding the IDE Bus

There is no evidence for X

So We Built Our Own Agentic Developer

The Art of Being Lazy(log)

Scientists Discover Life Thriving Beneath Fukushima's Dead Reactors

Technocracy 2.0

Something Wild Going on with Emails?

Home Assistant Comm Badge

SanDisk crushes wallets with up to 2.8X SSD price hikes

Start all of your commands with a comma

Sh-DSL – Write/Use Shell with Janet

Exploring Different Keyboard Sensing Technologies – LTT Labs

Windsurf Tab v2

Securely run Claude Code agents in Docker

Hand-Crafting Domain-Specific Compression with an LLM

The perks of being a mole rat

Show HN: A TikTok-style research paper reader

PaperBanana – Automating Academic Illustration

Readr, Safari-Like Reading Mode for Chrome

GitHub integrates Claude and Codex AI coding agents directly into GitHub

ClickHouse Agent Skills

Anthropic's new AI tool: Next black stock market day for the software industry

Ask HN: How can you enforce rules for Claude etc.

Tell HN: Electrolux HR chief hired to layoff workforce bought 12 room apartment

Mean People Fail (2014)

NYC subway gates tested by the MTA use AI tech to track fare evaders

Show HN: Autonomous AI radio station about engineering, history and philosophy

GitHub ponders kill switch for pull requests to stop AI slop