frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: DataFlow – Open Tool for LLM data prep 10k synthetic > 1M generic data

https://github.com/OpenDCAI/DataFlow
1•Mey0320•1h ago
Hi HN, We are the OpenDCAI team from Peking University. We just released DataFlow, an open-source framework designed to make LLM data preparation as programmable and modular as model training.

The Problem: While model architectures are standardized (PyTorch/JAX), data preparation is still dominated by ad-hoc scripts and loosely defined workflows. Most existing tools focus on "cleaning" or "filtering" existing large datasets, but modern LLM training increasingly relies on complex synthetic data generation and iterative refinement.

Our Solution: DataFlow DataFlow treats data processing like constructing a neural network. It provides a PyTorch-like programming interface where you compose Operators into Pipelines.

Key Technical Features: - Modular Abstraction: Just like torch.nn.Module, we provide standard interfaces for Operators, Prompt Templates, and Pipelines. - Rich Operator Zoo: Nearly 200 pre-built operators covering Text, Math, Code, Text-to-SQL, and RAG. - DataFlow-Agent: An agentic layer (built on LangGraph) that translates natural language requirements directly into executable pipelines.

Results: We found that data quality matters more than scale. - 10k > 1M: A unified 10k-sample dataset produced by DataFlow enables base models (Qwen2/2.5) to surpass counterparts trained on 1M generic instruction samples (Infinity-Instruct). - Code & SQL: Our pipelines achieved +7% improvement on code benchmarks and +3% execution accuracy in Text-to-SQL using significantly less data.

Links: - Paper: https://arxiv.org/abs/2512.16676 - Repo: https://github.com/OpenDCAI/DataFlow - Docs:https://opendcai.github.io/DataFlow-Doc/

We believe data engineering deserves the same level of rigorous abstraction as model architecture. We hope DataFlow can serve as a foundational substrate for future data-centric AI development.

Yann LeCun Is Raising Half a Billion Dollars to Build Nothing (Yet)

https://medium.com/@anwarzaid76/yann-lecun-is-raising-half-a-billion-dollars-to-build-nothing-yet...
1•MindBreaker2605•20s ago•0 comments

Show HN: No Fun Allowed

https://josevalerio.com/no-fun-allowed
1•josevalerio•3m ago•0 comments

Show HN: Zimage2.online – An AI image tool built on Alibaba's Z-Image model

https://zimage2.online/
1•chenliang001•5m ago•0 comments

The ML Trench

https://deep-ml-trench.vercel.app/
1•hexhowells•6m ago•0 comments

The iPhone 16e Is Good

https://manualdousuario.net/en/iphone-16e-is-good-actually/
1•rpgbr•7m ago•0 comments

AI in 2026 and beyond ⊗ Bioregionalism's tech-driven revival

https://sentiers.media/ai-in-2026-and-beyond-bioregionalisms-tech-driven-revival-no-384/
1•speckx•8m ago•0 comments

Show HN: Dots: a bullet journal I built to understand my migraines

https://dotsjournal.app/
1•tubignaaso•9m ago•0 comments

US submarines are outnumbered in the Pacific. South Korea has a plan to help

https://www.cnn.com/2025/12/20/asia/south-korea-nuclear-powered-submarines-intl-hnk-ml-dst
1•breve•10m ago•0 comments

Construct in 2025: Year in Review

https://www.construct.net/en/blogs/construct-official-blog-1/construct-2025-year-review-1898
1•AshleysBrain•14m ago•0 comments

Teardown of the Gigaset CL660HX DECT phone and how to disable annoying flash LED

https://github.com/hn/gigaset-cl660hx
1•hn___•16m ago•0 comments

New mathematical framework reshapes debate over simulation hypothesis

https://www.santafe.edu/news-center/news/new-mathematical-framework-reshapes-debate-over-simulati...
2•Gooblebrai•16m ago•0 comments

Show HN: Pilotbook.pro – born from spending more time on paperwork than flying

https://pilotbook.pro/
1•j4nitor•17m ago•0 comments

A Knapsack Public Key Cryptosystem Based on Arithmetic in Finite Fields (1988) [pdf]

https://people.csail.mit.edu/rivest/pubs/CR88.pdf
1•keepamovin•19m ago•0 comments

Google Cloud Infrastructure 2025: The Year Kubernetes Got Boring

https://www.aimeemarieknight.com/Google-Cloud-Infrastructure-2025-The-Year-Kubernetes-Got-Boring/
1•speckx•20m ago•0 comments

Use Claude Code with OpenRouter

https://openrouter.ai/docs/guides/guides/claude-code-integration
2•Topfi•21m ago•0 comments

Show HN: Mntn v2.0 – CLI for system maintenance, backups, and dotfile management

https://github.com/alexandretrotel/mntn
1•alexandretrotel•21m ago•0 comments

Grappling with its worst drought in a century, Iraq bets on oil-for-water deal

https://www.cnn.com/2025/12/21/climate/iraqs-oil-water-turkey-intl-latam
1•breve•24m ago•0 comments

ISBN Visualization Showing 99_959_000 books

https://annas-archive.li/isbn-visualization/
10•simon04•27m ago•1 comments

How to not end up in a Louis Rossmann video

https://sschueller.github.io/posts/how-to-not-end-up-in-a-louis-rossmann-video/
1•sschueller•30m ago•0 comments

Org Social surpassed twtxt in activity

https://preview.org-social.org/?post=https%3A%2F%2Fhost.org-social.org%2Fandros%2Fsocial.org%2320...
1•andros•32m ago•0 comments

Practical Tips for Cheating at Design (2018)

https://medium.com/refactoring-ui/7-practical-tips-for-cheating-at-design-40c736799886
1•Tomte•34m ago•0 comments

Techniques in Persuasion from Antiquity (2023)

https://www.thecollector.com/persuasive-technques-antiquity/
1•Tomte•34m ago•0 comments

The Infinite Software Crisis – Jake Nations, Netflix [video]

https://www.youtube.com/watch?v=eIoohUmYpGI
1•ceyhunkazel•36m ago•0 comments

Darktable 5.4 Released

https://www.darktable.org/2025/12/darktable-5.4.0-released/
3•Derbasti•37m ago•0 comments

Standard Chartered halves BTC USD 2025 target and pushes $500K goal to 2030

https://economictimes.indiatimes.com/news/international/us/bitcoin-price-forecast-cut-to-100k-why...
1•janandonly•40m ago•0 comments

Show HN: I built an LLM agent that finds you online and roasts you

https://santa.veris.ai
2•_josh_meyer_•40m ago•0 comments

The QuickShot II Joystick Returns [video]

https://www.youtube.com/watch?v=IaUU4Es4dGU
1•doener•41m ago•0 comments

Unix Fourth Edition

http://squoze.net/UNIX/v4/README
2•naves•42m ago•0 comments

Show HN: OpenHands-AAAA

https://codeberg.org/erkinalp/OpenHands-AAAA
1•anticensor•43m ago•1 comments

Show HN: GoRay – Ray Core for Golang

https://github.com/ray4go/go-ray
2•Wang0618•44m ago•0 comments