Show HN: I built an AI dataset generator

https://github.com/metabase/dataset-generator

99•matthewhefferon•6h ago

Comments

matthewhefferon•6h ago

I was tired of digging through Kaggle and writing prompts over and over just to get fake data for dashboards and demos. So I built a little tool to help me out.

It uses GPT-4o to generate a detailed schema and business rules based on a few dropdowns (like business type, schema structure, and row count). Then Faker fills in the rows using those rules, which keeps it fast and cheap.

You can preview the data, export as CSV or SQL, or spin up Metabase with one click to explore the data. It’s open-source, still in early stages, but wanted to share, get feedback and see how you'd improve it.

thenaturalist•4h ago

Congrats, thanks for shipping and open sourcing this!

Cool to see Metabase is enabling contributions to the ecosystem this way! :)

matthewhefferon•2h ago

No problem, thanks for taking a look!

margotli•5h ago

Feels like a useful tool for anyone learning analytics or just needing sample data to test with.

mritchie712•5h ago

I use this prompt to spin up demos for customers at https://www.definite.app/:

    @Web Do some research on https://somecompany.com and write up a detailed overview of what the company does. What might their database schema look like?

    I need you to build a mock database for them in duckdb for a demo

Then:

    Create a uv project and write a python script to add demo data. Use Faker.

    @Web research how many customers they have. Make the database to appropriate scale.

Only takes a few minutes in Cursor, should work just as well in Claude Code. It works really well for the companies core business, but I still need to create one to populate 3rd party sources (e.g. Stripe, Salesforce, Hubspot, etc.).

matthewhefferon•5h ago

Cool, I don’t do customer-specific demos, but I like this idea. I might add this use case as an option. Thanks for sharing!

b0a04gl•5h ago

seen this pattern a before too. faker holds shape without flow. real tables come from actions : retry, decline, manual review, all that. you just set col types, you might miss why the row even happened. gen needs to simulate behavior, not format

matthewhefferon•4h ago

That’s a solid callout, appreciate you pointing it out. I’ll definitely dig into that more.

ajd555•4h ago

Was looking for this exact comment. I completely agree with this method, especially if you're testing an entire flow, and not just a UI tool. You want to test the service that interfaces between the API and the dabatase.

I've been writing custom simulation agents (just simple go programs) that simulate different users of my system. I can scale appropriately and see test data flow in. If metabase could generate these simulation agents based on a schema and some instructions, now that would be quite neat! Good job on this first version of the tool, though!

tomrod•4h ago

The best synthetic data are those that capture ingestion and action, instead of just relationship.

Relationship is important, but your data structure might capture a virtually infinite number of unexpected behaviors that you would preferably call errors or bugs.

jasonthorsness•4h ago

AI is really good at this sort of thing; I've been using an LLM with Faker for some time to load data for demos into SingleStore: https://github.com/jasonthorsness/loadit

matthewhefferon•1h ago

Nice, I like the challenge video!

paxys•4h ago

Feature request - make the URL for the OpenAI API configurable. That way one can swap it out with Anthropic or any other LLM provider of their choice that provides an OpenAI-compatible API.

matthewhefferon•3h ago

I was actually thinking about this very feature in the shower this morning :)

wiradikusuma•4h ago

"Stack: OpenAI API (GPT-4o for data generation)" -- I wonder if someday we'll have a generic API like how it's done in Java (e.g., Servlet API implemented by Tomcat, JBoss etc), so everyone can use their favorite LLM instead of having to register each provider like streaming services e.g. Disney+, Netflix, etc.

matthewhefferon•1h ago

I hope so. I'm already subscribed to every streaming service, and my wallet can't handle all these LLMs too.

MattSayar•3h ago

I used Anthropic's new Claude API integration with artifacts to make a probably-worse version that you can play with (after logging in of course).

https://claude.ai/public/artifacts/eb7d8256-6d21-4c85-af9b-c...

I used this GitHub repo as context and Claude Opus 4 to create this artifact

jmsdnns•2h ago

depending on what you're using the synthetic data for, it is sometimes called distillation. here is a robust example from some upenn students: https://datadreamer.dev/

reedlaw•21m ago

"Dataset" connotes training data, but this seems to generate sample data, maybe for testing an application. Is there any use for synthetic datasets in ML?

smcleod•9m ago

This is a bit confusing, I sort of expected it to be a bit like Kiln https://github.com/Kiln-AI/Kiln to generate datasets for AI, but it looks like the outputs are more just data / files than datasets?

A.I. Is Homogenizing Our Thoughts

Launch HN: Issen (YC F24) – Personal AI language tutor

Google DeepMind Releases AlphaGenome

Memory Safety Is Merely Table Stakes

Starcloud says 1 launch, $8M but ISS tech says 17 launches, $850M+

Robots that learn

Kea 3.0, our first LTS version

A Review of Aerospike Nozzles: Current Trends in Aerospace Applications

"Why is the Rust compiler so slow?"

Matrix v1.15

SigNoz (YC W21, Open Source Datadog) Is Hiring DevRel Engineers (Remote)(US)

Introducing Gemma 3n

The time is right for a DOM templating API

Apple announces sweeping App Store changes in the EU

Snow - Classic Macintosh emulator

Show HN: I built an AI dataset generator

Shifts in diatom and dinoflagellate biomass in the North Atlantic over 6 decades

Low Overhead Allocation Sampling in a Garbage Collected Virtual Machine

A new pyramid-like shape always lands the same side up

Typr – TUI typing test with a word selection algorithm inspired by keybr

Puerto Rico's Solar Microgrids Beat Blackout

Lateralized sleeping positions in domestic cats

Show HN: Magnitude – open-source AI browser automation framework

-2000 Lines of code (2004)

The Business of Betting on Catastrophe

US economy shrank 0.5% in the first quarter, worse than earlier estimates

Tiny orange beads found by Apollo astronauts reveal Moon's explosive past

FLUX.1 Kontext [Dev] – Open Weights for Image Editing

Access BMC UART on Supermicro X11SSH

Muvera: Making multi-vector retrieval as fast as single-vector search

Show HN: I built an AI dataset generator

Comments

A.I. Is Homogenizing Our Thoughts

Launch HN: Issen (YC F24) – Personal AI language tutor

Google DeepMind Releases AlphaGenome

Memory Safety Is Merely Table Stakes

Starcloud says 1 launch, $8M but ISS tech says 17 launches, $850M+

Robots that learn

Kea 3.0, our first LTS version

A Review of Aerospike Nozzles: Current Trends in Aerospace Applications

"Why is the Rust compiler so slow?"

Matrix v1.15

SigNoz (YC W21, Open Source Datadog) Is Hiring DevRel Engineers (Remote)(US)

Introducing Gemma 3n

The time is right for a DOM templating API

Apple announces sweeping App Store changes in the EU

Snow - Classic Macintosh emulator

Show HN: I built an AI dataset generator

Shifts in diatom and dinoflagellate biomass in the North Atlantic over 6 decades

Low Overhead Allocation Sampling in a Garbage Collected Virtual Machine

A new pyramid-like shape always lands the same side up

Typr – TUI typing test with a word selection algorithm inspired by keybr

Puerto Rico's Solar Microgrids Beat Blackout

Lateralized sleeping positions in domestic cats

Show HN: Magnitude – open-source AI browser automation framework

-2000 Lines of code (2004)

The Business of Betting on Catastrophe

US economy shrank 0.5% in the first quarter, worse than earlier estimates

Tiny orange beads found by Apollo astronauts reveal Moon's explosive past

FLUX.1 Kontext [Dev] – Open Weights for Image Editing

Access BMC UART on Supermicro X11SSH

Muvera: Making multi-vector retrieval as fast as single-vector search