frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Show HN: I Processed Brazil's 85GB Open Company Registry So You Don't Have To

https://github.com/cnpj-chat/cnpj-data-pipeline
2•caiopizzol•3h ago
Last year, I needed to find all software companies in São Paulo for a project. The good news: Brazil publishes all company registrations as open data at dados.gov.br. The bad news: it's 85GB of ISO-8859-1 encoded CSVs with semicolon delimiters, decimal commas, and dates like "00000000" meaning NULL. My laptop crashed after 4 hours trying to import just one file.

So I built a pipeline that handles this mess: https://github.com/cnpj-chat/cnpj-data-pipeline

THE PROBLEM NOBODY TALKS ABOUT

Every Brazilian startup eventually needs this data - for market research, lead generation, or compliance. But everyone wastes weeks: - Parsing "12.345.678/0001-90" vs "12345678000190" CNPJ formats - Discovering that "00000000" isn't January 0th, year 0 - Finding out some companies are "founded" in 2027 (yes, the future) - Dealing with double-encoded UTF-8 wrapped in Latin-1

WHAT YOU CAN NOW DO IN SQL

Find all fintechs founded after 2020 in São Paulo:

SELECT COUNT(*) FROM estabelecimentos e JOIN empresas emp ON e.cnpj_basico = emp.cnpj_basico WHERE e.uf = 'SP' AND e.cnae_fiscal_principal LIKE '64%' AND e.data_inicio_atividade > '2020-01-01' AND emp.porte IN ('01', '03');

Result: 8,426 companies (as of Jun 2025)

SURPRISING THINGS I FOUND

1. The 3am Company Club: 4,812 companies were "founded" at exactly 3:00:00 AM. Turns out this is a database migration artifact from the 1990s.

2. Ghost Companies: ~2% of "active" companies have no establishments (no address, no employees, nothing). They exist only on paper.

3. The CNAE 9999999 Mystery: 147 companies have an economic activity code that doesn't exist in any reference table. When I tracked them down, they're all government entities from before the classification system existed.

4. Future Founders: 89 companies have founding dates in 2025-2027. Not errors - they're pre-registered for future government projects.

5. The MEI Boom: Micro-entrepreneurs (MEI) grew 400% during COVID. You can actually see the exact week in March 2020 when registrations spiked.

TECHNICAL BITS

The pipeline: - Auto-detects your RAM and adapts strategy (streaming for <8GB, parallel for >32GB) - Uses PostgreSQL COPY instead of INSERT (10x faster) - Handles incremental updates (monthly data refresh) - Includes missing reference data from SERPRO that official files omit

Processing 60M companies: - VPS (4GB RAM): ~8 hours - Desktop (16GB): ~2 hours - Server (64GB): ~1 hour

THE CODE

It's MIT licensed: https://github.com/cnpj-chat/cnpj-data-pipeline

One command setup: docker-compose --profile postgres up --build

Or if you prefer Python: python setup.py # Interactive configuration python main.py # Start processing

WHY OPEN SOURCE THIS?

I've watched too many devs waste weeks on this same problem. One founder told me they hired a consultancy for R$30k to deliver... a broken CSV parser. Another spent 2 months building ETL that processes 10% of the data before crashing.

The Brazilian tech ecosystem loses tons of hours reinventing this wheel. That's time that could be spent building actual products.

COMMUNITY RESPONSE

I've shared this with r/dataengineering and r/brdev, and the response has been incredible - over 50k developers have viewed it, and I've already incorporated dozens of improvements from their feedback. The most common reaction? "I wish I had this last month when I spent 2 weeks fighting these files."

QUESTIONS FOR HN

1. What other government datasets are this painful? I'm thinking of tackling more.

2. For those who've worked with government data - what's your worst encoding/format horror story?

3. Is there interest in a hosted API version? The infrastructure would be ~$100/month to serve queries.

The worst part? This data has been "open" since 2012. But open != accessible. Sometimes the best code is the code that deals with reality's mess so others don't have to.

Blog implemented using NextJS App router

https://github.com/gmoniava/personal-site
1•gmoniava•28s ago•0 comments

I was losing my mind job hunting so I built the tool I desperately needed

https://www.woberry.com/
1•frankvienna•29s ago•0 comments

'Impossible' particle that hit Earth may have been dark matter

https://www.newscientist.com/article/2483828-impossible-particle-that-hit-earth-may-have-been-dark-matter/
1•bookofjoe•1m ago•1 comments

Show HN: OntoCast – Extract RDF triples using LLMs and co-evolving ontologies

https://github.com/growgraph/ontocast
1•acrostoic•1m ago•0 comments

LLMs Don't Think Like Developers – Until Now

https://twitter.com/BoazLavon/status/1934959419147604235
1•danielmorozoff•2m ago•1 comments

Claude, Employee of the Month

https://chorus.sh/blog/claude-employee-of-the-month
1•Charlieholtz•2m ago•0 comments

Yes, One Person Could Destroy the World

https://gizmodo.com/could-a-single-individual-really-destroy-the-world-1471212186
2•squircle•4m ago•0 comments

ICE arrests NYC Comptroller because he asked to see a warrant

https://www.thecity.nyc/2025/06/17/brad-lander-arrest-ice-immigration-court/
20•sjsdaiuasgdia•6m ago•4 comments

We need to show AI what didn't work as well as what did

https://www.nature.com/articles/d41586-025-01908-0
1•rolph•9m ago•0 comments

Over half of the CO2 emitted from rivers comes from ancient carbon sources

https://wattsupwiththat.com/2025/06/16/settled-science-springs-a-leak-rivers-reveal-the-carbon-cycles-dirty-secret/
1•dadjoker•12m ago•0 comments

How to Dress and Undress Your Home

https://solar.lowtechmagazine.com/2025/06/dressing-and-undressing-the-home/
1•oftenwrong•16m ago•1 comments

Crosby – The First Agentic Law Firm

https://crosby.ai/post/introducing-crosby-the-worlds-first-hybrid-law-firm
6•jsarihan•17m ago•2 comments

New terms of service for mastodon.social and mastodon.online

https://mastodon.social/@Gargron/114699805737874224
4•doener•17m ago•0 comments

Show HN: Solon – Track business expenses made on personal cards

1•Solon_App•18m ago•0 comments

RNA as a Replacement for Chemical Pesticides

https://www.chemistryworld.com/news/rna-as-a-replacement-for-chemical-pesticides/4021654.article
1•crescit_eundo•19m ago•1 comments

Guidelines on how to be a scientific sleuth released

https://osf.io/2kdez/wiki/home/
3•crescit_eundo•20m ago•2 comments

Show HN: Xbom – Generate AI and SaaS-Aware SBOMs from Code Using Static Analysis

https://github.com/safedep/xbom
1•abhisek•20m ago•0 comments

Future of Work with AI Agents: Auditing Automation and Augmentation Potential

https://arxiv.org/abs/2506.06576
1•Anon84•20m ago•0 comments

Portable device detects poisonous pigment in books

https://standrewsuni-newsroom.prgloo.com/news/new-tool-to-identify-toxic-green-books
1•crescit_eundo•21m ago•0 comments

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

https://arxiv.org/abs/2506.09038
4•sjb326•22m ago•0 comments

Nightwatch: Deep observability for Laravel apps, built with Laravel

https://nightwatch.laravel.com
1•simonhamp•23m ago•0 comments

Online Schooling: Everything You Need to Know

https://www.thehomeschoolmom.com/online-schooling-everything-you-need-to-know/
1•indigodaddy•27m ago•0 comments

Show HN: Treble – A Fun, Mobile App for Learning Music Theory

https://apps.apple.com/us/app/treble-learn-music-theory/id6732229506
1•blakecross•31m ago•0 comments

The Bunker Buster Trump is Considering to Use Against Iranian Nuclear Facility

https://en.wikipedia.org/wiki/GBU-57A/B_MOP
3•atakan_gurkan•31m ago•0 comments

Celebrated pianist and writer Alfred Brendel dies aged 94

https://www.theguardian.com/music/2025/jun/17/celebrated-pianist-and-writer-alfred-brendel-dies-aged-94
7•mykowebhn•32m ago•2 comments

My website is my safe space

https://sightlessscribbles.com/posts/20250606/
1•raybb•32m ago•0 comments

Large Language Models and Emergence: A Complex Systems Perspective

https://arxiv.org/abs/2506.11135
2•mathgenius•35m ago•0 comments

Miscalculation by Spanish power grid operator REE contributed to blackout

https://www.reuters.com/business/energy/investigation-into-spains-april-28-blackout-shows-no-evidence-cyberattack-2025-06-17/
8•croes•37m ago•1 comments

2025 strongest handheld laser [video]

https://www.youtube.com/watch?v=UBVlL0FNbSE
1•justin66•39m ago•0 comments

UI, Pure and Simple [video]

https://www.youtube.com/watch?v=AGTDfXKGvNI
1•winkywooster•40m ago•0 comments