Problem:
High-quality datasets are scarce, expensive, and often sensitive. When teams turn to synthetic data, the difficulty isn't single prompts—it's the end-to-end system: designing branching/looping workflows, coordinating multiple inference backends/APIs and tool calls, enforcing validation + schema compliance + quality tagging at scale, and running fault-tolerant jobs with resumability, sharding, and streaming. Ad-hoc notebooks/scripts don't capture that lifecycle.
What SyGra is:
A graph-oriented framework where you define nodes (LLM calls, samplers, transforms, agents, subgraphs) and edges (conditional / parallel / loops). Author pipelines in low-code YAML (CLI-runnable) or compose in Python. Emphasis on structured outputs and reproducibility.
Key capabilities:
- Graph model: reusable subgraphs; conditional/parallel edges; loops
- Quality: dual-stage quality tagging (heuristics + LLM-based scoring); OASST-style conversation formatting
- Backends: vLLM, Hugging Face TGI, Azure OpenAI, Ollama (Triton-compatible)
- Data I/O: Hugging Face datasets (read/write, streaming) + local files; schema + metadata tracking
- Execution: async runtime; checkpointing/resume; sharding support; multimodal inputs (image/audio/text); agent/tool nodes via LangGraph
- Reproducibility: deterministic configs, seeds, artifact paths, and provenance logs
- Modes: CLI (execute YAML graphs) or Python APIs (embed in notebooks/apps)
- License: Apache-2.0
Links:
- Repo & README: https://github.com/ServiceNow/SyGra
- PyPI: https://pypi.org/project/sygra/
- Paper (design rationale): https://arxiv.org/abs/2508.15432
Disclosure: I'm part of the team behind SyGra.