The Problem: While model architectures are standardized (PyTorch/JAX), data preparation is still dominated by ad-hoc scripts and loosely defined workflows. Most existing tools focus on "cleaning" or "filtering" existing large datasets, but modern LLM training increasingly relies on complex synthetic data generation and iterative refinement.
Our Solution: DataFlow DataFlow treats data processing like constructing a neural network. It provides a PyTorch-like programming interface where you compose Operators into Pipelines.
Key Technical Features: - Modular Abstraction: Just like torch.nn.Module, we provide standard interfaces for Operators, Prompt Templates, and Pipelines. - Rich Operator Zoo: Nearly 200 pre-built operators covering Text, Math, Code, Text-to-SQL, and RAG. - DataFlow-Agent: An agentic layer (built on LangGraph) that translates natural language requirements directly into executable pipelines.
Results: We found that data quality matters more than scale. - 10k > 1M: A unified 10k-sample dataset produced by DataFlow enables base models (Qwen2/2.5) to surpass counterparts trained on 1M generic instruction samples (Infinity-Instruct). - Code & SQL: Our pipelines achieved +7% improvement on code benchmarks and +3% execution accuracy in Text-to-SQL using significantly less data.
Links: - Paper: https://arxiv.org/abs/2512.16676 - Repo: https://github.com/OpenDCAI/DataFlow - Docs:https://opendcai.github.io/DataFlow-Doc/
We believe data engineering deserves the same level of rigorous abstraction as model architecture. We hope DataFlow can serve as a foundational substrate for future data-centric AI development.