frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Make a local open-source AI chatbot with access to Fedora documentation

https://fedoramagazine.org/how-to-make-a-local-open-source-ai-chatbot-who-has-access-to-fedora-do...
1•jadedtuna•52s ago•0 comments

Introduce the Vouch/Denouncement Contribution Model by Mitchellh

https://github.com/ghostty-org/ghostty/pull/10559
1•samtrack2019•1m ago•0 comments

Software Factories and the Agentic Moment

https://factory.strongdm.ai/
1•mellosouls•1m ago•0 comments

The Neuroscience Behind Nutrition for Developers and Founders

https://comuniq.xyz/post?t=797
1•01-_-•1m ago•0 comments

Bang bang he murdered math {the musical } (2024)

https://taylor.town/bang-bang
1•surprisetalk•1m ago•0 comments

A Night Without the Nerds – Claude Opus 4.6, Field-Tested

https://konfuzio.com/en/a-night-without-the-nerds-claude-opus-4-6-in-the-field-test/
1•konfuzio•4m ago•0 comments

Could ionospheric disturbances influence earthquakes?

https://www.kyoto-u.ac.jp/en/research-news/2026-02-06-0
1•geox•5m ago•0 comments

SpaceX's next astronaut launch for NASA is officially on for Feb. 11 as FAA clea

https://www.space.com/space-exploration/launches-spacecraft/spacexs-next-astronaut-launch-for-nas...
1•bookmtn•6m ago•0 comments

Show HN: One-click AI employee with its own cloud desktop

https://cloudbot-ai.com
1•fainir•9m ago•0 comments

Show HN: Poddley – Search podcasts by who's speaking

https://poddley.com
1•onesandofgrain•10m ago•0 comments

Same Surface, Different Weight

https://www.robpanico.com/articles/display/?entry_short=same-surface-different-weight
1•retrocog•12m ago•0 comments

The Rise of Spec Driven Development

https://www.dbreunig.com/2026/02/06/the-rise-of-spec-driven-development.html
2•Brajeshwar•16m ago•0 comments

The first good Raspberry Pi Laptop

https://www.jeffgeerling.com/blog/2026/the-first-good-raspberry-pi-laptop/
3•Brajeshwar•16m ago•0 comments

Seas to Rise Around the World – But Not in Greenland

https://e360.yale.edu/digest/greenland-sea-levels-fall
2•Brajeshwar•16m ago•0 comments

Will Future Generations Think We're Gross?

https://chillphysicsenjoyer.substack.com/p/will-future-generations-think-were
1•crescit_eundo•20m ago•0 comments

State Department will delete Xitter posts from before Trump returned to office

https://www.npr.org/2026/02/07/nx-s1-5704785/state-department-trump-posts-x
2•righthand•23m ago•1 comments

Show HN: Verifiable server roundtrip demo for a decision interruption system

https://github.com/veeduzyl-hue/decision-assistant-roundtrip-demo
1•veeduzyl•24m ago•0 comments

Impl Rust – Avro IDL Tool in Rust via Antlr

https://www.youtube.com/watch?v=vmKvw73V394
1•todsacerdoti•24m ago•0 comments

Stories from 25 Years of Software Development

https://susam.net/twenty-five-years-of-computing.html
3•vinhnx•25m ago•0 comments

minikeyvalue

https://github.com/commaai/minikeyvalue/tree/prod
3•tosh•29m ago•0 comments

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

https://github.com/eval-exec/neomacs
1•evalexec•34m ago•0 comments

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

https://moli-green.is/
2•ShinyaKoyano•38m ago•1 comments

How I grow my X presence?

https://www.reddit.com/r/GrowthHacking/s/UEc8pAl61b
2•m00dy•40m ago•0 comments

What's the cost of the most expensive Super Bowl ad slot?

https://ballparkguess.com/?id=5b98b1d3-5887-47b9-8a92-43be2ced674b
1•bkls•40m ago•0 comments

What if you just did a startup instead?

https://alexaraki.substack.com/p/what-if-you-just-did-a-startup
5•okaywriting•47m ago•0 comments

Hacking up your own shell completion (2020)

https://www.feltrac.co/environment/2020/01/18/build-your-own-shell-completion.html
2•todsacerdoti•50m ago•0 comments

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

https://github.com/gorse-io/gorse
1•zhenghaoz•50m ago•0 comments

GLM-OCR: Accurate × Fast × Comprehensive

https://github.com/zai-org/GLM-OCR
1•ms7892•51m ago•0 comments

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

https://github.com/MikeVeerman/tool-calling-benchmark
1•MikeVeerman•52m ago•0 comments

Show HN: AboutMyProject – A public log for developer proof-of-work

https://aboutmyproject.com/
1•Raiplus•52m ago•0 comments
Open in hackernews

Ask HN: Is synthetic data generation practical outside academia?

5•cpard•8mo ago
I keep seeing synthetic data pipelines powering the latest LLM “breakthroughs”: • TinyZero’s $30 fine-tuning workflow • Sky-T1’s $450 reasoning-model build • Meta AI’s Llama 3 herd (2024 paper detailing their synthetic-data training) • Berkeley OpenThoughts (“Data Recipes for Reasoning Models”), published yesterday

There are also open-source toolkits you can experiment with:

https://github.com/meta-llama/synthetic-data-kit https://github.com/bespokelabsai/curator

But it still feels very research-oriented. I haven’t found many examples of these pipelines running in real-world products.

I’m curious:

1. Who is using synthetic-data pipelines in production today?

2. What tasks does it actually improve. E.g. fine-tuning smaller models for specific tasks?

Any real-world stories, pointers, or further reading would be hugely appreciated. Thanks!

Comments

sargstuff•8mo ago
Non-AI specific 'synthetic data generation':

historically used for processes which make use of time-series / simulations & modeling / forcasting. aka weather forcasting, related points in [0]

2) a) Testing with actual 'sensitive' data may not be possible for security reasons (aka payroll information, stock market price influences)[1]. b) insufficent/incomplete information. aka figure out how well what's known matches 'reality' and/or may suggest areas to look for 'missing' pieces in model.

-----

[0] : https://www.oreilly.com/library/view/practical-time-series/9...

[1] : https://www.k2view.com/what-is-synthetic-data-generation/

cpard•8mo ago
This is great. Synthetic data has been around for a long time, I think the difference with LLM related cases is that in the past it was primarily structured data that was a bit easier to approximate with some distribution or some grammar.

With synthetic data for large languages models it’s more about QA pairs and reasoning trails for solving complicated problems

sargstuff•8mo ago
Non-physics Much Ado about Shrodinger's Cat. Just tool(s) for quickly building higher order associations/abstractions from 'base term information'.[1][2][3]. aka dynamically generate a unique catlan number(s) for given Tromp lambda calcui as way of reducing tree height/lisp parentheses down to a single pair while dynamically computing/recomputing the determinant (appropriate base / number symbols ratio) to minimize length between parentheses.

----------------

[1] : I told AI to make me a protein. Here's what it came up with : https://www.nature.com/articles/d41586-025-01586-y

[2] : AI Models for Protein Structure Prediction : https://frontlinegenomics.com/ai-models-for-protein-structur...

[3] : AI model deciphers the code in proteins that tells them where to go : https://news.mit.edu/2025/ai-model-deciphers-code-proteins-t...

cpard•8mo ago
Love the references but I’m having a hard time deciphering your comment. Quantum physics where always fascinating to me but not always easy to comprehend I guess
sargstuff•7mo ago
yes/no. Don't know what Shrodinger's cat's state is until put into context. aka fundamental theorm of calculus provides symbolic evaluation context; but without expressions / 'numbers', theorm is just a 'shrodinger's cat' in different clothing.

One can escape the original NIL, NULL, None issue by using boolean logic, but that implies rules.

The 'strange thing' about shrodinger's cat, one can never be certain that one didn't pick the context before the cat existed and/or the references after the deceased remains were no longer visible. So, exercise is arguably statistically skewed toward 3/4 deceased, 1/4 alive. Add statistical sampling, and one can get an approximation of where things might be relative to cat life span. Only works one finds at least one instance of an 'alive cat' first. Much easier to just start with Boole's cat to avoid shrodinger's cat issues (aka lambda term). LLM's will happily supply relevant Boole's cat expression with/or without shrodinger's cat input.

Pauli might consider that a half baked cracker.

publicdaniel•8mo ago
It’s really useful for generating synthetic data for search and recommendations that you can use to train a smaller / faster model. This is especially useful if you don’t have lots of click-through data or with cold start scenarios. There are some good articles that cover this, if you’re interested I’ll try to find them and share
cpard•8mo ago
That would be amazing if you could share some references. Thank you!
publicdaniel•8mo ago
- https://scale.com/blog/synthetic-data-fine-tuning-llms

- https://eugeneyan.com/writing/recsys-llm/

- https://cookbook.openai.com/examples/sdg1

cpard•8mo ago
thank you Sir!
publicdaniel•8mo ago
I’m currently working on a document parsing engine for a specific type of document. The inputs are usually PDFs. I’m able to get great structured output from both the latest Gemini Flash models and the latest Llama Scout models. The best latency I get with Gemini is about 5 seconds end to end. With llama hosted on groq it’s about 3 seconds.

My use case is latency constrained, so I’m exploring fine tuning / distilling to see if I can get latency sub second. I imagine these are the kinds of scenarios where it’s still worth it to fine-tune and distill.

My plan is to generate a lot of synthetic training data using more capable slower foundation models and use that to train the smaller model.

cpard•8mo ago
Do you use any framework to generate the data and how do you evaluate the quality of the generated data?
Jugurtha•8mo ago
When I was in EE at university, I worked on heart anomaly detection and multi-phase flow classification for oil & gas. The papers I was reading used synthetic data with a nice noise dust sprinkled on it. Meanwhile, I worked on data from hospitals acquired on restless, sweaty, hairy, dudes with rusty, banged up electrodes and abused probes.

Needless to say, the data I saw on these papers looked nothing like the data I worked with, whether from hospitals or what I saw at Schlumberger in the Sahara.

The real world tends to be ... interesting.

cpard•8mo ago
That makes sense, do you think LLMs have or can potentially change that and end up having more realistic synthetic data than what you've seen in the past? I guess the data you were working were more like time series data but still if an large language model can be perceived as a universal approximator of some sort, might be able to generate more realistic synthetic data than the approach you described with the noise dust sprinkled on data.