Show HN: I built a tool to version control datasets (like Git, but for data)

2•aliefe04•3mo ago

Hey everyone,

As a founder, I've been frustrated for years with how my team manages datasets for ML. It always ends up as data_final_v3_fixed.csv in an S3 bucket or a massive Git LFS file that nobody understands.

So, I built Shodata. It’s an open platform (like GitHub) but built specifically for dataset workflows.

The core idea is simple: you upload a file. A new version (v2, v3, etc.) is automatically created when you upload a new file with the same name. You receive a discussion board on every dataset, a complete history, and clean previews and statistics for every version.

To show how it works, I seeded it with a dataset I'm tracking: a log of LLM hallucinations. When I find new ones, I just upload the new file and it versions the dataset.

The platform is an MVP. It has a generous free tier (includes 3 personal private datasets & 10GB storage) and a single Pro plan that unlocks team/organization features (like Org creation and shared private datasets).

I’m looking for feedback from fellow engineers and ML folks on the workflow. Is this useful? What’s missing?

You can check out the platform here: https://shodata.com

And the LLM log dataset: https://shodata.com/shodata/llm-hallucinations

Comments

vmykyt•3mo ago

That is good start

In (big-)data area the idea of data versioning is flying around for decades. As a current consensus for now is to treat information about your files, which is effectively a data, as a metadata.

Said this while trying to create your own solution is always good, maybe you could look at another solutions, like Apache Iceberg (free and open source).

In particular they have concept of Catalog

While from documentation it may look like to adopt Iceberg you need a lot of other moving part, in reality you can start from docker compose [2] and then manage your data using plain old sql syntax.

It may look lake overkill for your specific needs, still good source to steal some ideas.

P.S. there are plenty of such systems in various form-factor

[1] https://iceberg.apache.org/ [2] https://iceberg.apache.org/spark-quickstart/

aliefe04•3mo ago

Thanks for the feedback!

Shodata aims to solve a different problem: lightweight versioning for small-to-medium datasets with zero infrastructure setup. Think "GitHub for CSV files" rather than a full data lakehouse. Iceberg is excellent for production data lakes with Spark/Trino, but it requires running catalogs, configuring S3/Glue, and SQL knowledge. For many ML teams working with <100GB datasets, that's overkill. Our sweet spot is teams who need:

Drag-and-drop versioning (no CLI/SDK required) Instant previews and diff visualization Collaboration features (comments, access control) Public sharing (like the LLM hallucinations dataset)

I'll definitely look at Iceberg's catalog design for inspiration on metadata management. Appreciate the pointer!

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript

They Hijacked Our Tech [video]

Vouch

HRL Labs in Malibu laying off 1/3 of their workforce

Show HN: High-performance bidirectional list for React, React Native, and Vue

Show HN: I built a Mac screen recorder Recap.Studio

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

Vectors and HNSW for Dummies

Sanskrit AI beats CleanRL SOTA by 125%

'Washington Post' CEO resigns after going AWOL during job cuts

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

TSMC to produce 3-nanometer chips in Japan

Quantization-Aware Distillation

List of Musical Genres