frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Nomor Layanan WhatsApp Tokopedia

1•bandartogel99•1m ago•1 comments

Cara buka blokir Wonder by BNI lupa PIN dan salah password 3 kali

1•bandartogel99•2m ago•0 comments

Alleged Jabber Zeus Coder 'MrICQ' in U.S. Custody

https://krebsonsecurity.com/2025/11/alleged-jabber-zeus-coder-mricq-in-u-s-custody/
1•todsacerdoti•15m ago•0 comments

Yakiniku

https://en.wikipedia.org/wiki/Yakiniku
1•nomilk•16m ago•0 comments

Deep DIVE: AI progress continues, as IQ scores rise linearly

https://www.maximumtruth.org/p/deep-dive-ai-progress-continues-as
1•ctoth•16m ago•0 comments

Show HN: A/B Test Your LLM Prompts in Production

https://switchport.ai/
1•rjfc•17m ago•0 comments

"You Don't Need Kafka, Just Use Postgres" Considered Harmful

https://www.morling.dev/blog/you-dont-need-kafka-just-use-postgres-considered-harmful/
1•ingve•18m ago•0 comments

Show HN: AI Chat Terminal – Private data stays local, rest goes to cloud

https://github.com/martinschenk/ai-chat-terminal
1•ma8nk•18m ago•0 comments

Brushless Motors [video]

https://www.youtube.com/watch?v=me6034BrwN8
1•andai•21m ago•0 comments

Waiting does not make a lot of sense to me

https://larr.net/p/wait.html
3•lr0•22m ago•0 comments

Manage Docker on the Go

https://github.com/theSoberSobber/Docker-Manager
1•remux•24m ago•1 comments

New nanoparticles stimulate the immune system to attack ovarian tumors

https://news.mit.edu/2025/new-nanoparticles-stimulate-immune-system-attack-ovarian-tumors-1031
2•gmays•25m ago•0 comments

Aider-desk – Desktop application for Aider AI

https://github.com/hotovo/aider-desk
1•bratao•26m ago•0 comments

DiscoBSD

https://github.com/chettrick/discobsd
1•turtleyacht•29m ago•1 comments

Celebrating 9 Years of Bear

https://blog.bear.app/2025/11/celebrating-9-years-of-bear/
2•mindracer•29m ago•0 comments

Chaldean Alphabet [pdf]

https://www.swslhd.health.nsw.gov.au/services/Interpreter/PDF/chaldean.pdf
1•marysminefnuf•29m ago•0 comments

After the Last Git Commit

https://gist.github.com/igorcosta/bb40cdab6b1468c373d647164afae7ef
1•igorpcosta•30m ago•2 comments

Amazon Rivian Electric Delivery Vans Arrive in Canada

https://cleantechnica.com/2025/10/30/rivian-electric-delivery-vans-arrive-in-canada/
1•TMWNN•33m ago•0 comments

Ask HN: Why are QR codes not clickable links on browsers?

5•obilgic•33m ago•1 comments

Down with Template (Or Not)

https://cedardb.com/blog/down_with_template/
1•aw1621107•35m ago•0 comments

The modern homes hidden inside ancient ruins

https://www.ft.com/content/5f722a2e-71d8-430c-a476-95de2c4ad9a5
3•Stratoscope•36m ago•1 comments

Adaptive Twisting Metamaterials

https://dx.doi.org/10.1002/adma.202513714
1•PaulHoule•37m ago•0 comments

Improving Emacs' iCalendar Support

https://recursewithless.net/emacs/icalendar-diary-import-export.html
2•rwl•39m ago•0 comments

Will AI Kill the Firm?

https://www.project-syndicate.org/onpoint/ai-rewriting-rules-of-modern-capitalism-by-sami-mahroum...
1•johntfella•41m ago•0 comments

Dempster-Shafer and modelling beliefs about sets

https://emiruz.com/post/2025-10-30-epistemics/
1•usgroup•44m ago•0 comments

'This is the big one' – tech firms bet on electrifying rail

https://www.bbc.com/news/articles/czdjg92y00no
26•mikhael•51m ago•8 comments

MTurk is 20 years old today – what did you create with it?

13•csmoak•53m ago•2 comments

Snap benefits cut off during shutdown, driving long lines at food pantries

https://apnews.com/article/government-shutdown-food-lines-snap-6b55e2c21c0198f3309f3a45a55f33b6
8•consumer451•59m ago•1 comments

Pressure to Change

https://maryrosecook.com/blog/post/pressure-to-change
1•janpio•59m ago•0 comments

EventSourcingDB 1.2.0 Is Available

https://docs.eventsourcingdb.io/blog/2025/11/03/eventsourcingdb-120-is-available/
2•goloroden•1h ago•0 comments
Open in hackernews

Show HN: I built a tool to version control datasets (like Git, but for data)

https://shodata.com
2•aliefe04•7h ago
Hey everyone,

As a founder, I've been frustrated for years with how my team manages datasets for ML. It always ends up as data_final_v3_fixed.csv in an S3 bucket or a massive Git LFS file that nobody understands.

So, I built Shodata. It’s an open platform (like GitHub) but built specifically for dataset workflows.

The core idea is simple: you upload a file. A new version (v2, v3, etc.) is automatically created when you upload a new file with the same name. You receive a discussion board on every dataset, a complete history, and clean previews and statistics for every version.

To show how it works, I seeded it with a dataset I'm tracking: a log of LLM hallucinations. When I find new ones, I just upload the new file and it versions the dataset.

The platform is an MVP. It has a generous free tier (includes 3 personal private datasets & 10GB storage) and a single Pro plan that unlocks team/organization features (like Org creation and shared private datasets).

I’m looking for feedback from fellow engineers and ML folks on the workflow. Is this useful? What’s missing?

You can check out the platform here: https://shodata.com

And the LLM log dataset: https://shodata.com/shodata/llm-hallucinations

Comments

vmykyt•6h ago
That is good start

In (big-)data area the idea of data versioning is flying around for decades. As a current consensus for now is to treat information about your files, which is effectively a data, as a metadata.

Said this while trying to create your own solution is always good, maybe you could look at another solutions, like Apache Iceberg (free and open source).

In particular they have concept of Catalog

While from documentation it may look like to adopt Iceberg you need a lot of other moving part, in reality you can start from docker compose [2] and then manage your data using plain old sql syntax.

It may look lake overkill for your specific needs, still good source to steal some ideas.

P.S. there are plenty of such systems in various form-factor

[1] https://iceberg.apache.org/ [2] https://iceberg.apache.org/spark-quickstart/

aliefe04•39m ago
Thanks for the feedback!

Shodata aims to solve a different problem: lightweight versioning for small-to-medium datasets with zero infrastructure setup. Think "GitHub for CSV files" rather than a full data lakehouse. Iceberg is excellent for production data lakes with Spark/Trino, but it requires running catalogs, configuring S3/Glue, and SQL knowledge. For many ML teams working with <100GB datasets, that's overkill. Our sweet spot is teams who need:

Drag-and-drop versioning (no CLI/SDK required) Instant previews and diff visualization Collaboration features (comments, access control) Public sharing (like the LLM hallucinations dataset)

I'll definitely look at Iceberg's catalog design for inspiration on metadata management. Appreciate the pointer!