There's not enough information on creating commercially-viable datasets for LLMs. So here you go. It's the exact e2e pipeline I used for my last production model. It outputs LinkedIn posts that captures unique writing style.
You could just as easily copy its approach to build a dataset for generating SVGs, kubernetes deployment files, etc.
What's valuable is this example guides you through:
1. Generating the “golden dataset” from raw data
2. Labeling obvious categorical features (tone, bullets, etc.)
3. Extracting non-deterministic features (topic, opinions)
4. Encoding tacit human style features (pacing, vocabulary richness, punctuation patterns, narrative flow, topic transitions)
5. Assemble a prompt-completion template an LLM can actually learn from
6. Run ablation studies, permutation/correlation analyses to validate feature impact
7. Train with SFT and GRPO, using custom reward functions that mirror the original features so the model learns why a feature matters, not just that it exists
This approach has been used in a few VC-backed AI-first startups I've consulted with. Have fun.
jwarren92•2h ago
You could just as easily copy its approach to build a dataset for generating SVGs, kubernetes deployment files, etc.
What's valuable is this example guides you through:
1. Generating the “golden dataset” from raw data
2. Labeling obvious categorical features (tone, bullets, etc.)
3. Extracting non-deterministic features (topic, opinions)
4. Encoding tacit human style features (pacing, vocabulary richness, punctuation patterns, narrative flow, topic transitions)
5. Assemble a prompt-completion template an LLM can actually learn from
6. Run ablation studies, permutation/correlation analyses to validate feature impact
7. Train with SFT and GRPO, using custom reward functions that mirror the original features so the model learns why a feature matters, not just that it exists
This approach has been used in a few VC-backed AI-first startups I've consulted with. Have fun.