I also came across this pragmetic and grounded paper on building reliable, multi-step LLM programs. Instead of just chaining prompts, it treats the entire workflow as a single, typed probabilistic program where each step is a small, trainable PEFT module. The goal is to enforce correctness through gradient-based adaptation rather than just prompt optimization.
The paper highlights a few key results that make this seem particularly practical:
* Significant gains over prompt-optimization: On a structured symbolic math generation task (MGSM-SymPy), their method achieved 75.9% accuracy, while a strong DSPy baseline with constrained decoding scored 57.1%. The paper shows that TACS consistently outperforms prompt-optimization baselines, especially on smaller models or highly structured tasks.
* Makes smaller models viable: It shows how a 7B model that initially produced invalid, unparsable outputs 83% of the time was trained to be type-compliant. After just one epoch, the parsing failure rate dropped to 1%. This suggests adaptation can enforce correctness where prompting alone fails.
* A more principled approach: The core idea is to move away from brittle "prompt-hacking". You define a workflow graph with explicit input/output types, and the framework trains the lightweight adapters to respect those types. This allows for principled training on latent variables (like chain-of-thought steps) without needing direct supervision for them.
It seems like a solid step towards making complex LLM compositions more of an engineering discipline. It's less about finding the "magic prompt" and more about training small, specialized modules to be verifiably correct components in a larger system
aigobie24•5mo ago
The paper highlights a few key results that make this seem particularly practical:
* Significant gains over prompt-optimization: On a structured symbolic math generation task (MGSM-SymPy), their method achieved 75.9% accuracy, while a strong DSPy baseline with constrained decoding scored 57.1%. The paper shows that TACS consistently outperforms prompt-optimization baselines, especially on smaller models or highly structured tasks.
* Makes smaller models viable: It shows how a 7B model that initially produced invalid, unparsable outputs 83% of the time was trained to be type-compliant. After just one epoch, the parsing failure rate dropped to 1%. This suggests adaptation can enforce correctness where prompting alone fails.
* A more principled approach: The core idea is to move away from brittle "prompt-hacking". You define a workflow graph with explicit input/output types, and the framework trains the lightweight adapters to respect those types. This allows for principled training on latent variables (like chain-of-thought steps) without needing direct supervision for them.
It seems like a solid step towards making complex LLM compositions more of an engineering discipline. It's less about finding the "magic prompt" and more about training small, specialized modules to be verifiably correct components in a larger system