frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Show HN: I built an AI dataset generator

https://github.com/metabase/dataset-generator
74•matthewhefferon•4h ago

Comments

matthewhefferon•4h ago
I was tired of digging through Kaggle and writing prompts over and over just to get fake data for dashboards and demos. So I built a little tool to help me out.

It uses GPT-4o to generate a detailed schema and business rules based on a few dropdowns (like business type, schema structure, and row count). Then Faker fills in the rows using those rules, which keeps it fast and cheap.

You can preview the data, export as CSV or SQL, or spin up Metabase with one click to explore the data. It’s open-source, still in early stages, but wanted to share, get feedback and see how you'd improve it.

thenaturalist•2h ago
Congrats, thanks for shipping and open sourcing this!

Cool to see Metabase is enabling contributions to the ecosystem this way! :)

margotli•3h ago
Feels like a useful tool for anyone learning analytics or just needing sample data to test with.
mritchie712•3h ago
I use this prompt to spin up demos for customers at https://www.definite.app/:

    @Web Do some research on https://somecompany.com and write up a detailed overview of what the company does. What might their database schema look like?

    I need you to build a mock database for them in duckdb for a demo

Then:

    Create a uv project and write a python script to add demo data. Use Faker.

    @Web research how many customers they have. Make the database to appropriate scale.

Only takes a few minutes in Cursor, should work just as well in Claude Code. It works really well for the companies core business, but I still need to create one to populate 3rd party sources (e.g. Stripe, Salesforce, Hubspot, etc.).
matthewhefferon•2h ago
Cool, I don’t do customer-specific demos, but I like this idea. I might add this use case as an option. Thanks for sharing!
b0a04gl•2h ago
seen this pattern a before too. faker holds shape without flow. real tables come from actions : retry, decline, manual review, all that. you just set col types, you might miss why the row even happened. gen needs to simulate behavior, not format
matthewhefferon•2h ago
That’s a solid callout, appreciate you pointing it out. I’ll definitely dig into that more.
ajd555•2h ago
Was looking for this exact comment. I completely agree with this method, especially if you're testing an entire flow, and not just a UI tool. You want to test the service that interfaces between the API and the dabatase.

I've been writing custom simulation agents (just simple go programs) that simulate different users of my system. I can scale appropriately and see test data flow in. If metabase could generate these simulation agents based on a schema and some instructions, now that would be quite neat! Good job on this first version of the tool, though!

tomrod•2h ago
The best synthetic data are those that capture ingestion and action, instead of just relationship.

Relationship is important, but your data structure might capture a virtually infinite number of unexpected behaviors that you would preferably call errors or bugs.

jasonthorsness•1h ago
AI is really good at this sort of thing; I've been using an LLM with Faker for some time to load data for demos into SingleStore: https://github.com/jasonthorsness/loadit
paxys•1h ago
Feature request - make the URL for the OpenAI API configurable. That way one can swap it out with Anthropic or any other LLM provider of their choice that provides an OpenAI-compatible API.
matthewhefferon•1h ago
I was actually thinking about this very feature in the shower this morning :)
wiradikusuma•1h ago
"Stack: OpenAI API (GPT-4o for data generation)" -- I wonder if someday we'll have a generic API like how it's done in Java (e.g., Servlet API implemented by Tomcat, JBoss etc), so everyone can use their favorite LLM instead of having to register each provider like streaming services e.g. Disney+, Netflix, etc.
MattSayar•1h ago
I used Anthropic's new Claude API integration with artifacts to make a probably-worse version that you can play with (after logging in of course).

https://claude.ai/public/artifacts/eb7d8256-6d21-4c85-af9b-c...

I used this GitHub repo as context and Claude Opus 4 to create this artifact

Show HN: Zizmor, static analysis for GitHub Actions

https://docs.zizmor.sh/
1•woodruffw•1m ago•0 comments

Agents will do your most time-consuming, deepest work for you – in minutes

https://www.cbinsights.com/team-of-agents/
1•jsaltzman20•1m ago•1 comments

Quiero que inteneten hackear mi sitio

https://30b2-2802-8010-9926-1a00-75f0-56f1-5c2b-79c4.ngrok-free.app/login.php
1•elhackloco•1m ago•1 comments

Illusion of Thinking Exploration Tool

https://github.com/NeurometricAI/illusion-of-thinking
1•robmay•1m ago•1 comments

Peter Thiel and the Antichrist

https://www.nytimes.com/2025/06/26/opinion/peter-thiel-antichrist-ross-douthat.html
1•sirodoht•2m ago•1 comments

Extreme heat can impact infrastructure

https://www.axios.com/2025/06/26/extreme-heat-infrastructure
1•toomuchtodo•2m ago•0 comments

Group of high-profile authors sue Microsoft over use of books in AI training

https://www.theguardian.com/technology/2025/jun/26/microsoft-ai-authors-lawsuit
1•chrisjj•3m ago•0 comments

Critical Hurricane Forecast Tool Abruptly Terminated

https://michaelrlowry.substack.com/p/critical-hurricane-forecast-tool
1•speckx•5m ago•0 comments

Show HN: An MCP server for every GitHub repo

https://github.com/JudiniLabs/mcp-code-graph
1•aracena•9m ago•0 comments

What Is OpenTelemetry?

https://signoz.io/blog/what-is-opentelemetry/
1•thunderbong•9m ago•0 comments

Rogue jumping genes can spur Alzheimer's, ALS

https://knowablemagazine.org/content/article/health-disease/2025/awakened-viral-jumping-genes-role-in-alzheimers-als
1•gmays•9m ago•0 comments

How to Properly Use Polystate?

https://github.com/sdzx-1/ray-game/blob/master/How-to-Use-Polystate.md
1•goless•10m ago•1 comments

Routle

https://sfstandard.com/games/routle/
1•michaefe•10m ago•0 comments

Voice App to Launch and Manage Your Paid Ad Campaigns

https://adsuiteai.com/staging/2796/
1•vinaygpandit•14m ago•0 comments

Corel Vector is being discontinued soon

https://www.coreldraw.com/en/product/vector/
1•marcodiego•16m ago•0 comments

Orange Me2eets: We made an E2E encrypted video calling app and it was easy

https://blog.cloudflare.com/orange-me2eets-we-made-an-end-to-end-encrypted-video-calling-app-and-it-was/
2•cantaloupe•17m ago•0 comments

Jane Street's Sneaky Retention Tactic

https://www.economist.com/finance-and-economics/2025/06/26/jane-streets-sneaky-retention-tactic
4•actinium226•21m ago•3 comments

New Zine: The Secret Rules of the Terminal

https://jvns.ca/blog/2025/06/24/new-zine--the-secret-rules-of-the-terminal/
1•Bogdanp•22m ago•0 comments

Lateralized sleeping positions in domestic cats

https://www.cell.com/current-biology/fulltext/S0960-9822(25)00507-X?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS096098222500507X%3Fshowall%3Dtrue
11•EvgeniyZh•22m ago•5 comments

Tangled titles are contributing to rural America's housing crisis

https://theconversation.com/family-homesteads-with-tangled-titles-are-contributing-to-rural-americas-housing-crisis-254679
2•PaulHoule•24m ago•0 comments

Fortnite cheater must pay $175K to Epic, banned forever from playing Fortnite

https://twitter.com/FNCompetitive/status/1937966876790751232
2•bundie•24m ago•0 comments

Ubuntu Maker Canonical Generated Nearly $300M in Revenue Last Year

https://www.phoronix.com/news/Canonical-2024-Annual-Report
1•mikece•24m ago•0 comments

Is it legal to use copyrighted works to train LLMs?

https://www.spinellis.gr/blog/20250626/
2•cratermoon•24m ago•0 comments

Math Enthusiasts Unite to Have Rover Calculate Pi on the Moon

https://www.scientificamerican.com/article/math-enthusiasts-unite-to-have-rover-calculate-pi-on-the-moon/
2•almost-exactly•27m ago•0 comments

The Year of Quantum: From concept to reality in 2025

https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-year-of-quantum-from-concept-to-reality-in-2025
2•donutloop•29m ago•0 comments

Show HN: Magnitude – open-source AI browser automation framework

https://github.com/magnitudedev/magnitude
7•anerli•30m ago•0 comments

A Message Drifts: Explorations in meanings from randomness in generative art

https://www.amygoodchild.com/blog/a-message-drifts
2•bertwagner•30m ago•0 comments

Will AI Slop Kill the Internet? [video]

https://www.youtube.com/watch?v=NuIMZBseAOM
3•saltysalt•30m ago•2 comments

U.S. Lawmakers Urge Action on Cybersecurity in Face of Quantum Threat

https://thequantuminsider.com/2025/06/26/u-s-lawmakers-urge-urgent-action-on-cybersecurity-in-face-of-quantum-threat/
1•donutloop•31m ago•0 comments

FLUX.1 Kontext [Dev] Inference and Training

https://blog.fal.ai/announcing-flux-1-kontext-dev-inference-training/
3•amrrs•32m ago•0 comments