frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Scraping was easy. Cleaning the data was the hard part

https://rangelead.com/
1•RangeLead•2h ago

Comments

RangeLead•2h ago
I’ve built and maintained a pipeline that collects public business listings at scale.

Scraping itself was straightforward compared to everything that followed.

Once the data volume grows, most of the work shifts to:

-handling inconsistent categories -deduplication across sources -outdated or closed businesses -missing or misleading fields -deciding what “usable” actually means

Many datasets look large but fall apart when you try to use them for anything practical.

This post breaks down where most scraping projects fail once they move beyond small experiments, and what actually takes time when you want clean output.

Would be interested to hear how others here approached data validation and cleanup at scale.

Agile Standup Calculator

https://agilelie.com/tools/standup-tax
1•ghostinit•29s ago•1 comments

Illustrations of 2025 (NYT)

https://www.nytimes.com/2025/12/26/arts/year-in-illustration.html
1•cosiiine•1m ago•0 comments

Show HN: Here Be Shovels – a calm supply depot for indie makers and AI builders

https://www.herebeshovels.com/
1•anticlickwise•3m ago•0 comments

MongoBleed: MongoDB Unauthenticated Memory Leak Exploit

https://github.com/joe-desimone/mongobleed
1•nailer•4m ago•2 comments

Everyone at the company should be using Claude Code and GitHub

https://twitter.com/obie/status/2005689332976271459
1•obiefernandez•5m ago•0 comments

Can we make Security Empirical, and why might we want to?

https://www.fightforthehuman.com/empirical-security/
1•adrianhoward•6m ago•0 comments

Unstract: Open-source platform to ship document extraction APIs/MCPs in minutes

https://github.com/Zipstack/unstract
1•naren87•10m ago•0 comments

Five Years of Tinygrad

https://geohot.github.io//blog/jekyll/update/2025/12/29/five-years-of-tinygrad.html
2•iyaja•10m ago•0 comments

Minnesota Daycare Fraud Tied to Waltz

https://www.foxnews.com/politics/misspelled-learning-center-no-children-inside-emmer-presses-walz...
2•Geonode•15m ago•1 comments

Building a Storage Engine That Outperforms RocksDB

https://tidesdb.com/articles/what-i-learned-building-a-storage-engine-that-outperforms-rocksdb/
1•mau•16m ago•0 comments

4dev.com Unveils Funding Hub, Simplifying Cross-Border Investment in Startups

2•darius88•17m ago•0 comments

CD Projekt and GOG Co-Founder Michał Kiciński Acquires GOG from CD Projekt

https://www.gamingonlinux.com/2025/12/cd-projekt-and-gog-co-founder-micha-kicinski-acquires-gog-f...
2•embedding-shape•17m ago•0 comments

26 Useful Concepts for 2026

https://www.gurwinder.blog/p/26-useful-concepts-for-2026
1•paulpauper•18m ago•0 comments

PhDs Can't Find Work as Boston's Biotech Engine Sputters

https://www.wsj.com/tech/biotech/ph-d-s-cant-find-work-as-bostons-biotech-engine-sputters-729f0036
2•nradov•18m ago•0 comments

Not 1% better every day

https://jaylol.com/2025/10/09/one-percent-every-day/
1•marifjeren•20m ago•0 comments

Vajont Dam

https://en.wikipedia.org/wiki/Vajont_Dam
1•vinnyglennon•20m ago•0 comments

When to Graduate from College?

https://arnoldkling.substack.com/p/when-to-graduate-from-college
1•paulpauper•21m ago•0 comments

Ten things that are going right in America

https://www.noahpinion.blog/p/ten-things-that-are-going-right-in
2•paulpauper•21m ago•0 comments

The Year of the 3D Printed Miniature (and Other Lies We Tell Ourselves)

https://matduggan.com/the-year-of-the-3d-printed-miniature-and-other-lies-we-tell-ourselves/
1•sagacity•26m ago•0 comments

You spin me right round (like a Wi-Fi identifier)

https://sixcolors.com/post/2025/12/you-spin-me-right-round-like-a-wi-fi-identifier/
2•xngbuilds•27m ago•0 comments

Spacetime as a Neural Network

https://benr.build/blog/autodidactic-universe
1•bisonbear•27m ago•0 comments

Neuroscientists used Hollywood films to map out the human experience

https://nin.nl/news/how-neuroscientists-used-hollywood-films-to-map-out-the-human-experience/
1•giuliomagnifico•28m ago•0 comments

Feasibility of Using Mealworms as an Alternative Protein Source

https://www.mdpi.com/2304-8158/14/23/4068
1•PaulHoule•32m ago•0 comments

OnlyFans is no longer accessible in China

https://www.cnn.com/2024/12/05/tech/china-onlyfans-accessible-hnk-intl
4•A4ET8a8uTh0_v2•33m ago•0 comments

Show HN: Object – A universal file storage system

https://github.com/metorial/object-storage
2•tobihrbr•35m ago•0 comments

Third Parties and Single Points of Failure

https://calendar.perfplanet.com/2025/third-parties-and-single-points-of-failure/
1•zdw•35m ago•0 comments

I Fixed My Coworker's Alignment Problem

https://hallofdreams.org/posts/i-fixed-my-coworkers-alignment-problem/
2•TheCog•37m ago•0 comments

Survey State of Angel Investing in AI 2026

https://docs.google.com/forms/d/e/1FAIpQLSd-ZWu5aLr6BuJELfXVI9q0QmH-Nr1f4SZ3e8DCc3haMCGoWw/viewform
1•Patrick_Mebus•37m ago•0 comments

Show HN: Splat, an Affinity Diagramming Tool in a Single HTML File

https://github.com/ianarawjo/splat
1•fatso784•37m ago•0 comments

PostTrainBench: Measuring how well AI agents can post-train language models

https://posttrainbench.com/
1•frozenseven•37m ago•0 comments