I’ve realized that a 3,000-token system prompt isn't "logic", it's legacy code that no one wants to touch. It’s brittle, hard to test, and expensive to run. It is Technical Debt.
My thesis is that we need to stop treating prompts as the "program" and start treating them as temporary specs that eventually get compiled into the model weights via fine-tuning.
I built Steer (open source) to automate this "refactoring" process. It helps you climb the "Deliberation Ladder":
1. The Floor (Validity): Use Steer's deterministic verifiers (regex, AST, JSON Schema) to block objective failures in real-time. Don't ask an LLM if JSON is valid; check it with code.
2. The Ceiling (Quality): Use `steer export` to turn those captured failures into a fine-tuning dataset, training the model to handle nuance and "vibes" without a massive prompt.
Curious if others are seeing this "Prompt Bloat" in production?
steer_dev•2h ago
I’ve realized that a 3,000-token system prompt isn't "logic", it's legacy code that no one wants to touch. It’s brittle, hard to test, and expensive to run. It is Technical Debt.
My thesis is that we need to stop treating prompts as the "program" and start treating them as temporary specs that eventually get compiled into the model weights via fine-tuning.
I built Steer (open source) to automate this "refactoring" process. It helps you climb the "Deliberation Ladder":
1. The Floor (Validity): Use Steer's deterministic verifiers (regex, AST, JSON Schema) to block objective failures in real-time. Don't ask an LLM if JSON is valid; check it with code.
2. The Ceiling (Quality): Use `steer export` to turn those captured failures into a fine-tuning dataset, training the model to handle nuance and "vibes" without a massive prompt.
Curious if others are seeing this "Prompt Bloat" in production?
Repo: https://github.com/imtt-dev/steer