frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Ask HN: Is synthetic data generation practical outside academia?

4•cpard•11h ago
I keep seeing synthetic data pipelines powering the latest LLM “breakthroughs”: • TinyZero’s $30 fine-tuning workflow • Sky-T1’s $450 reasoning-model build • Meta AI’s Llama 3 herd (2024 paper detailing their synthetic-data training) • Berkeley OpenThoughts (“Data Recipes for Reasoning Models”), published yesterday

There are also open-source toolkits you can experiment with:

https://github.com/meta-llama/synthetic-data-kit https://github.com/bespokelabsai/curator

But it still feels very research-oriented. I haven’t found many examples of these pipelines running in real-world products.

I’m curious:

1. Who is using synthetic-data pipelines in production today?

2. What tasks does it actually improve. E.g. fine-tuning smaller models for specific tasks?

Any real-world stories, pointers, or further reading would be hugely appreciated. Thanks!

Comments

sargstuff•9h ago
Non-AI specific 'synthetic data generation':

historically used for processes which make use of time-series / simulations & modeling / forcasting. aka weather forcasting, related points in [0]

2) a) Testing with actual 'sensitive' data may not be possible for security reasons (aka payroll information, stock market price influences)[1]. b) insufficent/incomplete information. aka figure out how well what's known matches 'reality' and/or may suggest areas to look for 'missing' pieces in model.

-----

[0] : https://www.oreilly.com/library/view/practical-time-series/9...

[1] : https://www.k2view.com/what-is-synthetic-data-generation/

cpard•9h ago
This is great. Synthetic data has been around for a long time, I think the difference with LLM related cases is that in the past it was primarily structured data that was a bit easier to approximate with some distribution or some grammar.

With synthetic data for large languages models it’s more about QA pairs and reasoning trails for solving complicated problems

sargstuff•1h ago
Non-physics Much Ado about Shrodinger's Cat. Just tool(s) for quickly building higher order associations/abstractions from 'base term information'.[1][2][3]. aka dynamically generate a unique catlan number(s) for given Tromp lambda calcui as way of reducing tree height/lisp parentheses down to a single pair while dynamically computing/recomputing the determinant (appropriate base / number symbols ratio) to minimize length between parentheses.

----------------

[1] : I told AI to make me a protein. Here's what it came up with : https://www.nature.com/articles/d41586-025-01586-y

[2] : AI Models for Protein Structure Prediction : https://frontlinegenomics.com/ai-models-for-protein-structur...

[3] : AI model deciphers the code in proteins that tells them where to go : https://news.mit.edu/2025/ai-model-deciphers-code-proteins-t...

Analyzing the Trump Bullet Photo

https://substack.com/home/post/p-164991902
1•mwidell•1m ago•0 comments

The Concurrency Trap: How an Atomic Counter Stalled a Pipeline

https://www.conviva.com/platform/the-concurrency-trap-how-an-atomic-counter-stalled-a-pipeline/
1•delifue•4m ago•0 comments

Hacking Is Necessary

https://scharenbroch.dev/blog/hacking-is-necessary/
2•thunderbong•16m ago•0 comments

Show HN: I built a directory with 14K viral LinkedIn posts

https://bestlinkedinposts.com/
1•derkinzi•17m ago•0 comments

Out of Stock

https://www.nytimes.com/2025/06/07/briefing/out-of-stock.html
1•donohoe•21m ago•0 comments

Show HN: COSS – Open-Source Standard with AI-Readable Project Metadata

https://www.contriboss.com
1•seuros•31m ago•1 comments

Three Ways to Try FreeBSD in Under Five Minutes

https://freebsdfoundation.org/blog/three-ways-to-try-freebsd-in-under-five-minutes/
1•rodrigo975•31m ago•0 comments

OpenAI Suddenly Deletes a Programmer's Account

https://twitter.com/burkov/status/1931066030446793074
1•cft•34m ago•0 comments

GenAI-Assisted Fantasies – Communications of the ACM

https://cacm.acm.org/blogcacm/genai-assisted-fantasies/
1•rbanffy•45m ago•0 comments

CXL AI and Liquid Cooled Gigabyte Servers at Computex 2025 – ServeTheHome

https://www.servethehome.com/cxl-ai-and-liquid-cooled-gigabyte-servers-at-computex-2025/
1•rbanffy•46m ago•0 comments

Anthropic releases custom AI chatbot for classified spy work

https://arstechnica.com/ai/2025/06/anthropic-releases-custom-ai-chatbot-for-classified-spy-work/
2•pseudolus•48m ago•0 comments

Colors the Peasantry Wore in the Middle Ages and Renaissance Part One

http://isabelladangelo.blogspot.com/2019/01/colors-peasantry-wore-in-middle-ages.html
1•Bluestein•50m ago•1 comments

Rubenerd: Australian Navy ship blocks Kiwi Internet

https://rubenerd.com/australian-navy-ship-blocks-kiwi-internet/
2•rbanffy•51m ago•0 comments

Ferry Operators Bill

1•luhenba•52m ago•0 comments

How NASA Plans to Deal with Death in Space

https://www.jalopnik.com/1879088/how-nasa-plans-to-deal-with-death-in-space/
1•rntn•52m ago•0 comments

MapLibre Newsletter May 2025

https://maplibre.org/news/2025-06-02-maplibre-newsletter-may-2025/
1•todsacerdoti•53m ago•0 comments

30 years ago, Apple fans met the Mac clone. This is the weird, wild story

https://www.macworld.com/article/2796769/the-weird-wild-story-of-the-mac-clone-era.html
1•mafro•53m ago•0 comments

Digipin: A Geospatial Addressing Solution by India Post

https://github.com/CEPT-VZG/digipin
1•47thpresident•56m ago•0 comments

Unveiling the EndBOX

https://www.endbasic.dev/2025/06/unveiling-the-endbox.html
3•jaypatelani•1h ago•0 comments

Rendering Assassins Creed: Shadows

https://www.youtube.com/watch?v=yj5pYktC3X8
2•anotherhue•1h ago•0 comments

DTS: X is losing to Dolby Atmos

https://www.flatpanelshd.com/news.php?subaction=showfull&id=1749195083
1•woldemariam•1h ago•0 comments

LeCabot, a $135 open-source alternative to Spot by BostonDynamics

https://github.com/phospho-app/lecabot
2•bottomotto•1h ago•0 comments

The Hidden Diary of Samuel Pepys

https://www.historytoday.com/archive/feature/hidden-diary-samuel-pepys
1•pseudolus•1h ago•0 comments

Fast limited-range conversion between ints and floats

https://purplesyringa.moe/blog/./fast-limited-range-conversion-between-ints-and-floats/
1•usdogu•1h ago•0 comments

First Map Made of a Solid's Quantum Geometry

https://www.quantamagazine.org/first-map-made-of-a-solids-secret-quantum-geometry-20250606/
2•pseudolus•1h ago•0 comments

Trump lifts US supersonic flight ban, says he's 'Making Aviation Great Again'

https://www.theregister.com/2025/06/07/trump_supersonic_flight/
7•beardyw•1h ago•1 comments

Show HN: TapNfix – Instant help, anytime, anywhere

1•TapNfix•1h ago•0 comments

Cut Across, Hare

https://medium.com/luminasticity/cut-across-hare-0c5a791e0c06
1•bryanrasmussen•1h ago•0 comments

Buyer with Ties to Chinese Communist Party Got VIP Treatment at Crypto Dinner

https://www.nytimes.com/2025/06/06/us/politics/trump-crypto-dinner-china-he-tianying.html
4•perihelions•1h ago•0 comments

HMAS Canberra accidentally blocks wireless internet in New Zealand

https://www.abc.net.au/news/2025-06-07/australian-ship-navigation-radar-new-zealand-internet/105388702
3•kepair•1h ago•0 comments