frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Real-world dataset creation, SFT fine-tuning, and GRPO alignment pipeline

https://github.com/jacobwarren/social-media-ai-engineering-etl
1•jwarren92•2h ago

Comments

jwarren92•2h ago
There's not enough information on creating commercially-viable datasets for LLMs. So here you go. It's the exact e2e pipeline I used for my last production model. It outputs LinkedIn posts that captures unique writing style.

You could just as easily copy its approach to build a dataset for generating SVGs, kubernetes deployment files, etc.

What's valuable is this example guides you through:

1. Generating the “golden dataset” from raw data

2. Labeling obvious categorical features (tone, bullets, etc.)

3. Extracting non-deterministic features (topic, opinions)

4. Encoding tacit human style features (pacing, vocabulary richness, punctuation patterns, narrative flow, topic transitions)

5. Assemble a prompt-completion template an LLM can actually learn from

6. Run ablation studies, permutation/correlation analyses to validate feature impact

7. Train with SFT and GRPO, using custom reward functions that mirror the original features so the model learns why a feature matters, not just that it exists

This approach has been used in a few VC-backed AI-first startups I've consulted with. Have fun.

Browse Travel and Adventure Across Alabama »

https://www.abdal.online/2025/08/Travel-Adventure.html
1•ABD-Alabama•2m ago•0 comments

Trump Is Building His Own Paramilitary Force

https://www.nytimes.com/2025/08/27/opinion/ezra-klein-podcast-radley-balko.html
2•nabla9•6m ago•0 comments

Test Microsoft's first in-house voice model, MAI-Voice-1

https://copilot.microsoft.com/labs/audio-expression
1•kitcar•7m ago•0 comments

Non-newsletter #1: This One's for the Survivors

https://mailchi.mp/gizra/this-ones-for-the-survivors
1•amitaibu•8m ago•0 comments

Debian 13: My list of new features

https://samueloph.dev/blog/debian-13-my-list-of-exciting-new-things/
2•jandeboevrie•11m ago•0 comments

Acne vaccines could offer robust defence

https://www.nature.com/articles/d41586-025-02652-1
1•bookofjoe•14m ago•0 comments

Large language models can reconstruct forbidden knowledge

https://www.fastcompany.com/91391442/how-large-language-models-can-reconstruct-forbidden-knowledge
1•toss1•15m ago•0 comments

China vs. the West: Unity vs. Freedom

https://www.boris.fyi/unity-vs-freedom
1•sirobg•16m ago•0 comments

Citrix forgot to tell you CVE-2025–6543 has been used as a zero day since May

https://doublepulsar.com/citrix-forgot-to-tell-you-cve-2025-6543-has-been-used-as-a-zero-day-sinc...
2•speckx•17m ago•0 comments

My startup banking story (2023)

https://mitchellh.com/writing/my-startup-banking-story
2•dvrp•17m ago•0 comments

Start and track Copilot coding agent tasks from Raycast

https://github.blog/changelog/2025-08-28-start-and-track-copilot-coding-agent-tasks-from-raycast/
1•timrogers•18m ago•0 comments

Donald Trump's Big Gay Government

https://www.nytimes.com/2025/08/26/style/gay-men-trump-administration-republicans.html
2•whack•19m ago•2 comments

RFC 8594: The Sunset HTTP Header Field

https://datatracker.ietf.org/doc/html/rfc8594
2•aiven•20m ago•1 comments

Vivaldi slams Google, Microsoft for shoving AI into browsers, vows to stay clear

https://www.neowin.net/news/vivaldi-slams-google-and-microsoft-for-cramming-ai-into-browsers-says...
3•bundie•22m ago•2 comments

Show HN: Put text in between images (Nano Banana)

https://www.textbetween.com/
1•westche2222•24m ago•1 comments

Engineers send quantum signals with standard Internet Protocol

https://phys.org/news/2025-08-quantum-standard-internet-protocol.html
6•layer8•26m ago•0 comments

New evidence strongly suggest AI is killing jobs for young programmers

https://www.understandingai.org/p/new-evidence-strongly-suggest-ai
5•CharlesW•27m ago•0 comments

FBI, Dutch cops seize fake ID marketplace that sold identity docs for $9

https://www.theregister.com/2025/08/28/fbi_dutch_cops_seize_veriftools/
3•rntn•29m ago•0 comments

Show HN: Universal Chat UI for AI Agents

https://www.craffted.dev/
2•ddaras•32m ago•2 comments

Adding limestone to farmland boosts carbon capture and crop yields, study finds

https://phys.org/news/2025-08-adding-limestone-farmland-boosts-carbon.html
1•PaulHoule•33m ago•0 comments

Marshmellow Laser Feast: experimental art collective

1•cl3misch•34m ago•0 comments

AMA with Z.ai, the Lab Behind GLM Models

https://old.reddit.com/r/LocalLLaMA/comments/1n2ghx4/ama_with_zai_the_lab_behind_glm_models/
2•bratao•36m ago•1 comments

What's your database infra in 2025? (2 min survey)

https://survey.springtail.io/database-survey-2025
1•gszundi•37m ago•1 comments

New Xcode beta adds GPT-5, Claude account support

https://sixcolors.com/link/2025/08/apples-new-xcode-beta-adds-gpt-5-claude-account-support/
2•CharlesW•38m ago•0 comments

Cakedesk Invoicing App

https://cakedesk.app
1•carlosjobim•39m ago•0 comments

Ask HN: Where can I see a live octopus in Maine?

3•Octopus88•39m ago•1 comments

Ask HN: What is the future of software salaries in the age of AI coding agents?

1•jplusequalt•46m ago•1 comments

Simulating wealth distribution in an agent-based system

https://notebooks.manganiello.tech/fabio/wealth-inequality.ipynb
3•blacklight•46m ago•0 comments

What I learned vibe coding a WASM CSV Parser

https://www.importcsv.com/blog/wasm-csv-parser-complete-story
3•aray07•47m ago•0 comments

Show HN: A flat monthly subscription for open-source LLMs

https://synthetic.new/newsletter/entries/subscriptions
4•reissbaker•52m ago•0 comments