Does the 'user' agent get fed a specific chunk of text to formulate its questions, and does the 'assistant' agent get that exact same chunk to reply? If they're both looking at the identical text, have you thought about injecting some noise or unrelated distractor chunks into the assistant's context? Might be a solid way to make the resulting SFT data more robust against hallucinations.
monatis•2h ago
If you’re working with internal docs, regulatory text, or technical manuals, there’s plenty of material but zero multi-turn chat logs. And flattening this into standard instruction/response pairs creates models that sound like templates, failing to capture how users actually ask for clarification or push back.
So we open-sourced a small, opinionated library called AfterImage.
It generates synthetic multi-turn conversations grounded in a corpus you provide. The architecture is straightforward: - A simulated user ("Correspondent") with optional persona variation - A simulated assistant ("Respondent") - Both strictly grounded via sampled source material - Outputs directly to JSONL for your SFT (Supervised Fine-Tuning) / eval pipelines
*Why build this?* The narrow bet here is that multi-turn dialogue is its own distinct data problem. There are already great general synthetic data tools (distilabel, synthetic-data-kit). We aren't competing with them. AfterImage prioritizes composable design where generation can be customized with callbacks. For example, you can connect it to various data sources such as local files or Qdrant collections, or you can choose retriever strategies for RAG or aggregation methods for composite evaluation.
*A few honest caveats:* - We don’t have a strong published benchmark yet (semantic similarity only so far). - Quality noticeably degrades/loops as conversations get too long (>5+ turns). Luckily, one-to-three turns is more than enough for most SFT cases.