frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Ask HN: Is synthetic data generation practical outside academia?

4•cpard•12h ago
I keep seeing synthetic data pipelines powering the latest LLM “breakthroughs”: • TinyZero’s $30 fine-tuning workflow • Sky-T1’s $450 reasoning-model build • Meta AI’s Llama 3 herd (2024 paper detailing their synthetic-data training) • Berkeley OpenThoughts (“Data Recipes for Reasoning Models”), published yesterday

There are also open-source toolkits you can experiment with:

https://github.com/meta-llama/synthetic-data-kit https://github.com/bespokelabsai/curator

But it still feels very research-oriented. I haven’t found many examples of these pipelines running in real-world products.

I’m curious:

1. Who is using synthetic-data pipelines in production today?

2. What tasks does it actually improve. E.g. fine-tuning smaller models for specific tasks?

Any real-world stories, pointers, or further reading would be hugely appreciated. Thanks!

Comments

sargstuff•10h ago
Non-AI specific 'synthetic data generation':

historically used for processes which make use of time-series / simulations & modeling / forcasting. aka weather forcasting, related points in [0]

2) a) Testing with actual 'sensitive' data may not be possible for security reasons (aka payroll information, stock market price influences)[1]. b) insufficent/incomplete information. aka figure out how well what's known matches 'reality' and/or may suggest areas to look for 'missing' pieces in model.

-----

[0] : https://www.oreilly.com/library/view/practical-time-series/9...

[1] : https://www.k2view.com/what-is-synthetic-data-generation/

cpard•10h ago
This is great. Synthetic data has been around for a long time, I think the difference with LLM related cases is that in the past it was primarily structured data that was a bit easier to approximate with some distribution or some grammar.

With synthetic data for large languages models it’s more about QA pairs and reasoning trails for solving complicated problems

sargstuff•2h ago
Non-physics Much Ado about Shrodinger's Cat. Just tool(s) for quickly building higher order associations/abstractions from 'base term information'.[1][2][3]. aka dynamically generate a unique catlan number(s) for given Tromp lambda calcui as way of reducing tree height/lisp parentheses down to a single pair while dynamically computing/recomputing the determinant (appropriate base / number symbols ratio) to minimize length between parentheses.

----------------

[1] : I told AI to make me a protein. Here's what it came up with : https://www.nature.com/articles/d41586-025-01586-y

[2] : AI Models for Protein Structure Prediction : https://frontlinegenomics.com/ai-models-for-protein-structur...

[3] : AI model deciphers the code in proteins that tells them where to go : https://news.mit.edu/2025/ai-model-deciphers-code-proteins-t...

Best place for small remote gigs?

4•xucian•4h ago•4 comments

How AI is impacting jobs

2•dnsharma•2h ago•0 comments

Ask HN: Any good tools for viewing congressional bills?

95•tlhunter•20h ago•39 comments

Ask HN: Startup getting spammed with PayPal disputes, what should we do?

279•june3739•3d ago•180 comments

Ask HN: Do we need a language designed specifically for AI code generation?

3•baijum•8h ago•4 comments

Ask HN: What would you work on if you couldn't fail?

7•rblion•13h ago•10 comments

Ask HN: Has anybody built search on top of Anna's Archive?

285•neonate•3d ago•146 comments

Ask HN: A Tetris variant with greater tactical and strategic depth?

3•amichail•10h ago•0 comments

Ask HN: Anyone else feeling increasingly alienated from the industry?

32•saubeidl•1d ago•23 comments

Ask HN: Who is hiring? (June 2025)

367•whoishiring•4d ago•474 comments

Ask HN: Is synthetic data generation practical outside academia?

4•cpard•12h ago•3 comments

Get Your Dev Tool Mentioned by ChatGPT, Gemini Not Just Ranked on Google

4•vinodvarma24•17h ago•1 comments

I Built an AI Agent with Gmail Access and Discovered a Security Hole

4•Ada-Ihueze•13h ago•3 comments

Ask HN: How do I learn robotics in 2025?

398•srijansriv•4d ago•99 comments

Tiptap open-sources 10 formerly Pro extensions under MIT license

9•philipisik•20h ago•1 comments

Ask HN: Anyone making a living from a paid API?

248•meander_water•6d ago•173 comments

Summer projects (preferably open source) for college sophomores

3•decartesfolium•16h ago•1 comments

Ask HN: How do I learn practical electronic repair?

182•juanse•1w ago•112 comments

Ask HN: Who wants to be hired? (June 2025)

125•whoishiring•4d ago•395 comments

Ask HN: Options for One-Handed Typing

93•Townley•3d ago•93 comments

Ask HN: What are some good resources for coding best practices?

6•genericmask•21h ago•3 comments

Ask HN: Should I build a directory product?

3•alizaid•21h ago•3 comments

Ask HN: Running AI agents in isolated environments

5•polycaster•1d ago•1 comments

Ask HN: What do you put in claude.md and what you leave out?

7•bognition•1d ago•2 comments

Ask HN: What Does Your Self-Hosted LLM Stack Look Like in 2025?

17•anditherobot•2d ago•6 comments

Ask HN: What is the best LLM for consumer grade hardware?

238•VladVladikoff•1w ago•182 comments

Ask HN: What are your fav/goto decision making hacks/heuristics?

6•ottaborra•1d ago•11 comments

Ask HN: How are parents who program teaching their kids today?

102•laze00•5d ago•91 comments

Ask HN: Where do you go for cutting-edge dev news and info?

4•TimTheTinker•1d ago•10 comments

Ask HN: Walking while working and having meetings

3•martythemaniak•1d ago•6 comments