frontpage.

Hi HN, We are the OpenDCAI team from Peking University. We just released DataFlow, an open-source framework designed to make LLM data preparation as programmable and modular as model training.

The Problem: While model architectures are standardized (PyTorch/JAX), data preparation is still dominated by ad-hoc scripts and loosely defined workflows. Most existing tools focus on "cleaning" or "filtering" existing large datasets, but modern LLM training increasingly relies on complex synthetic data generation and iterative refinement.

Our Solution: DataFlow DataFlow treats data processing like constructing a neural network. It provides a PyTorch-like programming interface where you compose Operators into Pipelines.

Key Technical Features: - Modular Abstraction: Just like torch.nn.Module, we provide standard interfaces for Operators, Prompt Templates, and Pipelines. - Rich Operator Zoo: Nearly 200 pre-built operators covering Text, Math, Code, Text-to-SQL, and RAG. - DataFlow-Agent: An agentic layer (built on LangGraph) that translates natural language requirements directly into executable pipelines.

Results: We found that data quality matters more than scale. - 10k > 1M: A unified 10k-sample dataset produced by DataFlow enables base models (Qwen2/2.5) to surpass counterparts trained on 1M generic instruction samples (Infinity-Instruct). - Code & SQL: Our pipelines achieved +7% improvement on code benchmarks and +3% execution accuracy in Text-to-SQL using significantly less data.

Links: - Paper: https://arxiv.org/abs/2512.16676 - Repo: https://github.com/OpenDCAI/DataFlow - Docs:https://opendcai.github.io/DataFlow-Doc/

We believe data engineering deserves the same level of rigorous abstraction as model architecture. We hope DataFlow can serve as a foundational substrate for future data-centric AI development.

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler