frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: I built a tool to version control datasets (like Git, but for data)

https://shodata.com
2•aliefe04•1d ago
Hey everyone,

As a founder, I've been frustrated for years with how my team manages datasets for ML. It always ends up as data_final_v3_fixed.csv in an S3 bucket or a massive Git LFS file that nobody understands.

So, I built Shodata. It’s an open platform (like GitHub) but built specifically for dataset workflows.

The core idea is simple: you upload a file. A new version (v2, v3, etc.) is automatically created when you upload a new file with the same name. You receive a discussion board on every dataset, a complete history, and clean previews and statistics for every version.

To show how it works, I seeded it with a dataset I'm tracking: a log of LLM hallucinations. When I find new ones, I just upload the new file and it versions the dataset.

The platform is an MVP. It has a generous free tier (includes 3 personal private datasets & 10GB storage) and a single Pro plan that unlocks team/organization features (like Org creation and shared private datasets).

I’m looking for feedback from fellow engineers and ML folks on the workflow. Is this useful? What’s missing?

You can check out the platform here: https://shodata.com

And the LLM log dataset: https://shodata.com/shodata/llm-hallucinations

Comments

vmykyt•1d ago
That is good start

In (big-)data area the idea of data versioning is flying around for decades. As a current consensus for now is to treat information about your files, which is effectively a data, as a metadata.

Said this while trying to create your own solution is always good, maybe you could look at another solutions, like Apache Iceberg (free and open source).

In particular they have concept of Catalog

While from documentation it may look like to adopt Iceberg you need a lot of other moving part, in reality you can start from docker compose [2] and then manage your data using plain old sql syntax.

It may look lake overkill for your specific needs, still good source to steal some ideas.

P.S. there are plenty of such systems in various form-factor

[1] https://iceberg.apache.org/ [2] https://iceberg.apache.org/spark-quickstart/

aliefe04•18h ago
Thanks for the feedback!

Shodata aims to solve a different problem: lightweight versioning for small-to-medium datasets with zero infrastructure setup. Think "GitHub for CSV files" rather than a full data lakehouse. Iceberg is excellent for production data lakes with Spark/Trino, but it requires running catalogs, configuring S3/Glue, and SQL knowledge. For many ML teams working with <100GB datasets, that's overkill. Our sweet spot is teams who need:

Drag-and-drop versioning (no CLI/SDK required) Instant previews and diff visualization Collaboration features (comments, access control) Public sharing (like the LLM hallucinations dataset)

I'll definitely look at Iceberg's catalog design for inspiration on metadata management. Appreciate the pointer!

Show HN: a Rust ray tracer that runs on any GPU – even in the browser

https://github.com/tchauffi/rust-rasterizer
6•tchauffi•48m ago•0 comments

Show HN: An agent for every website, for agentic visitors

https://web.ai/
2•kjok•45m ago•1 comments

Show HN: Centia.io – Open PostgreSQL/PostGIS back end for developers

https://centia.io/
16•mhoegh•1w ago•4 comments

Show HN: I built an AI that generates full-stack apps in 30 seconds

7•TulioKBR•2h ago•10 comments

Show HN: Serie – A rich Git commit graph in your terminal

https://github.com/lusingander/serie
4•lusingander•5h ago•0 comments

Show HN: Safebox: Open-source framework for managing self-hosted apps (Beta)

2•drebora•4h ago•0 comments

Show HN: Anki-LLM – Bulk process and generate Anki flashcards with LLMs

https://github.com/raine/anki-llm
52•rane•1d ago•21 comments

Show HN: Strange Attractors

https://blog.shashanktomar.com/posts/strange-attractors
779•shashanktomar•2d ago•75 comments

Show HN: Why write code if the LLM can just do the thing? (web app experiment)

https://github.com/samrolken/nokode
427•samrolken•1d ago•309 comments

Show HN: Give your coding agents the ability to message each other

https://github.com/Dicklesworthstone/mcp_agent_mail
11•eigenvalue•16h ago•1 comments

Show HN: Pipelex – Declarative language for repeatable AI workflows

https://github.com/Pipelex/pipelex
120•lchoquel•5d ago•26 comments

Show HN: DeepFake – Free AI Face Swap Online

https://deepfakefusion.com
3•epistemovault•10h ago•3 comments

Show HN: PyTogether, open-source lightweight real-time Python IDE for teachers

https://pytogether.org/
3•JawadR•11h ago•0 comments

Show HN: goilerplate – A SaaS boilerplate for Go and templ and Htmx

https://goilerplate.com/
3•axadrn•4h ago•0 comments

Show HN: In a single HTML file, an app to encourage my children to invest

https://roberdam.com/en/dinversiones.html
247•roberdam•4d ago•434 comments

Show HN: Duper – The Format That's Super

https://duper.dev.br/
30•epiceric•1d ago•14 comments

Show HN: I built a Raspberry Pi webcam to train my dog (using Claude)

https://github.com/harshibar/yogi-cam
5•hyerramreddy•14h ago•0 comments

Show HN: A simple drag and drop tool to document and label fuse boxes

https://github.com/alexadam/fuse-box-labels
25•eg312•3d ago•6 comments

Show HN: Quibbler – A critic for your coding agent that learns what you want

https://github.com/fulcrumresearch/quibbler
114•etherio•3d ago•27 comments

Show HN: GT: experimental multiplexed distributed tensor framework

https://github.com/bwasti/gt
3•brrrrrm•15h ago•0 comments

Show HN: Giving AI to your favorite whiteboard, Excalidraw

https://www.opencanvas.studio
5•winzamark12•16h ago•0 comments

Show HN: Learn German with Games

https://www.learngermanwithgames.com/
125•predictand•5d ago•106 comments

Show HN: KeyLeak Detector – Scan websites for exposed API keys and secrets

https://github.com/Amal-David/keyleak-detector
26•amaldavid•1d ago•7 comments

Show HN: Chatolia – create, train and deploy your own AI agents

https://www.chatolia.com
4•blurayfin•17h ago•1 comments

Show HN: Run a GitHub Actions step in a gVisor sandbox

https://github.com/geomys/sandboxed-step
85•FiloSottile•1w ago•3 comments

Show HN: UnisonDB – Log-native KV database that replicates like a message bus

https://unisondb.io
16•ankuranand•1d ago•0 comments

Show HN: I built a smart blocker after destroying my dopamine baseline

https://chromewebstore.google.com/detail/memento-mori/fhpkanfhfplcfpmklplbbenimajbahim
17•Rahul07oii•20h ago•6 comments

Show HN: Front End Fuzzy and Substring and Prefix Search

https://github.com/m31coding/fuzzy-search
56•kmschaal•5d ago•4 comments

Show HN: Auto-Adjust Keyboard and LCD Brightness via Ambient Light Sensor[Linux]

https://github.com/donjajo/als-led-backlight
3•donjajo•1d ago•0 comments

Show HN: I made a heatmap diff viewer for code reviews

https://0github.com
263•lawrencechen•4d ago•68 comments