I built Misata because existing tools (Faker, Mimesis) are great for random rows but terrible for relational or temporal integrity. I needed to generate data for a dashboard where "Timesheets" must happen after "Project Start Date," and I wanted to define these rules via natural language.
How it works: LLM Layer: Uses Groq/Llama-3.3 to parse a "story" into a JSON schema constraint config.
Simulation Layer: Uses Vectorized NumPy (no loops) to generate data. It builds a DAG of tables to ensure parent rows exist before child rows (referential integrity).
Performance: Generates ~250k rows/sec on my M1 Air.
It’s early alpha. The "Graph Reverse Engineering" (describe a chart -> get data) is experimental but working for simple curves.
pip install misata
I’d love feedback on the simulator.py architecture—I’m currently keeping data in-memory (Pandas) which hits a ceiling at ~10M rows. Thinking of moving to DuckDB for out-of-core generation next. Thoughts?