AfterImage – Generate synthetic multi-turn chat data from documents

https://github.com/altaidevorg/afterimage

5•monatis•2h ago

Comments

monatis•2h ago

We kept running into the same exact bottleneck with fine-tuning and evals: You have the source documents, and you have the base model, but you usually don’t have the actual conversations.

If you’re working with internal docs, regulatory text, or technical manuals, there’s plenty of material but zero multi-turn chat logs. And flattening this into standard instruction/response pairs creates models that sound like templates, failing to capture how users actually ask for clarification or push back.

So we open-sourced a small, opinionated library called AfterImage.

It generates synthetic multi-turn conversations grounded in a corpus you provide. The architecture is straightforward: - A simulated user ("Correspondent") with optional persona variation - A simulated assistant ("Respondent") - Both strictly grounded via sampled source material - Outputs directly to JSONL for your SFT (Supervised Fine-Tuning) / eval pipelines

*Why build this?* The narrow bet here is that multi-turn dialogue is its own distinct data problem. There are already great general synthetic data tools (distilabel, synthetic-data-kit). We aren't competing with them. AfterImage prioritizes composable design where generation can be customized with callbacks. For example, you can connect it to various data sources such as local files or Qdrant collections, or you can choose retriever strategies for RAG or aggregation methods for composite evaluation.

*A few honest caveats:* - We don’t have a strong published benchmark yet (semantic similarity only so far). - Quality noticeably degrades/loops as conversations get too long (>5+ turns). Luckily, one-to-three turns is more than enough for most SFT cases.

efecnc•2h ago

Simulating a user that actually sounds real is definitely the hardest part of this. Curious how you're handling the chunking and retrieval under the hood here.

Does the 'user' agent get fed a specific chunk of text to formulate its questions, and does the 'assistant' agent get that exact same chunk to reply? If they're both looking at the identical text, have you thought about injecting some noise or unrelated distractor chunks into the assistant's context? Might be a solid way to make the resulting SFT data more robust against hallucinations.

monatis•1h ago

Yeah this is one possible way to generate grounded an"responses" in Afterimage. To accomplish context augmentation when generating a response, it allows to use different RAG strategies where retriever may be chosen for the specific use case at hand. This is where composability comes into play.

What to know about naval blockades as U.S. begins patrols the Strait of Hormuz

Caveman – why use many token when few do trick

Clojure The Documentary, official film [video]

Women in Tech: Journeys, Grit, and the Future We're Building

We gave an AI a 3 year retail lease and asked it to make a profit

How to Outline Text (Badly, at First)

Prt-Scan: AI-Powered GitHub Actions Supply Chain Attack

Short Attention Span Theater

Bullet train upgrade brings 5G windows and noise-cancelling cabins to Japan

Saving Us from Ourselves?

Epicycles All the Way Down

Synchronous Programming for Kids: A Manifesto [video]

The Unix Executable as a Smalltalk Method [video]

Dizzying Spiral Staircase with Single Guardrail Once Led to Top of Eiffel Tower

Ask HN: Why no insurance is fully transparent about how they handle each case?

Serious Weaknesses in the EU Age Verification App

DESI Completes Planned 3D Map of the Universe

What is audio visual entrainment? Science, benefits, and how it works

Show HN: I built a free Mac app locker because AppLocker ($18) freezes Macs

Making backwards- and forwards-compatible web programs (kbrecordzz)

A Gentle Introduction to Mercury

Cruise industry eyes nuclear to power a sustainable future

Solve by Default

Warzone will be renamed to War.app

The Secret Art of Elicitation

Lead Full-Stack Engineer Marker Learning – Remote

Show HN: Yutu – A modern Lua linter written in Rust

How Missions Work

Gemma 4-written, small cc0 encyclopedia of some core science content

It's the End of the Internet as We Know It