fp.

How it works: Agents execute tasks, reflect on what worked/failed, and curate a "playbook" of strategies. All from execution feedback - no training data needed.

Happy to answer questions about the implementation or the research!

Comments

vebgen•3mo ago

This is fascinating! The "evolving playbook" approach resonates with challenges we've been tackling building an AI agent for Django development.

A few questions about your implementation:

1. How do you handle the balance between delta updates and full context rewrites when the playbook grows large? We've found that keeping detailed history helps with debugging but can bloat context quickly.

2. The Generator/Reflector/Curator separation is elegant. Did you implement these as separate LLM calls or different prompting strategies on the same model? We use a similar dual-agent pattern (planner + executor) and the coordination overhead is non-trivial.

3. Most interesting part: "natural execution feedback without labeled supervision." How do you define success/failure signals for the Reflector in ambiguous cases? For code generation, it's easy (tests pass/fail), but for other domains it seems trickier.

The +10.6% improvement on agent tasks is impressive - definitely checking out the paper. The brevity bias problem you mention is real - we've noticed agents dropping important context details when trying to "summarize efficiently."

kayba•3mo ago

Thanks for the great questions! Here's how we're tackling these:

1. Context growth management:

We avoid full context rewrites entirely, they cause context collapse where the LLM compresses away important details. Instead, we use delta updates as the foundation and are exploring:

- Semantic de-duplication to remove redundancy - Keeping deltas as the source of truth with optional summarization layers on top - Pre-filtering the playbook to feed the model a more focused version, with tooling to let it explore further when needed

Delta updates remain our core principle, but we're actively working on preventing context bloat as playbooks scale.

2. Role separation:

Our library lets you select different models for each role, with prompts specifically tailored to each function. So far we've mostly used the same model for all three roles, but we're actively exploring model mixing as a promising direction.

3. Success signals:

The system shows strong self-assessment capabilities using execution feedback (code pass/fail, API responses, and model interactions with the environment). However, you're right that ambiguous domains are trickier, this is still an open challenge for us. Our vision is to pre-seed domain knowledge through curated playbooks or training samples, then let models self-explore and discover their own success patterns over time.

What I'm curious about:

- What feedback signals work for your Django agent?

- How do you handle planner-executor coordination overhead?

- Have you hit similar brevity bias issues?

Would love to continue this conversation on Discord if you're interested: https://discord.com/invite/mqCqH7sTyK

jimmySixDOF•3mo ago

this kind of DSpy-GEPA self improvement loop keeps popping up and adding a few points but the cost (API and wall clock)also means you use this where a repeatable task/prompt/context needs optimizing and you can afford to find better templates

kayba•3mo ago

You're right that cost and latency are important considerations. However, the research shows this isn't just about finding better templates, it's about enabling agentic systems to learn and improve from their previous attempts and failures.

We believe in-context learning is one of the missing pieces to make agentic systems feasible in production. The key is that systems can adapt without expensive fine-tuning or retraining. The paper shows *86.9% lower adaptation latency* and significant reductions in rollout costs compared to existing methods, making this approach more practical than previous optimization techniques.

The real value is in systems that progressively get better at their tasks through experience, not just one-time prompt optimization.

If you want to continue this conversation just hit me up on Discord: https://discord.com/invite/mqCqH7sTyK

jimmySixDOF•3mo ago

I did look into DataRobot's Syftr which points at the same problem but is a lot heavier I definitely like that the approach you take is at least easy to get a basic version up and can start checking the results right away!