frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: I Processed Brazil's 85GB Open Company Registry So You Don't Have To

https://github.com/cnpj-chat/cnpj-data-pipeline
4•caiopizzol•7mo ago
Last year, I needed to find all software companies in São Paulo for a project. The good news: Brazil publishes all company registrations as open data at dados.gov.br. The bad news: it's 85GB of ISO-8859-1 encoded CSVs with semicolon delimiters, decimal commas, and dates like "00000000" meaning NULL. My laptop crashed after 4 hours trying to import just one file.

So I built a pipeline that handles this mess: https://github.com/cnpj-chat/cnpj-data-pipeline

THE PROBLEM NOBODY TALKS ABOUT

Every Brazilian startup eventually needs this data - for market research, lead generation, or compliance. But everyone wastes weeks: - Parsing "12.345.678/0001-90" vs "12345678000190" CNPJ formats - Discovering that "00000000" isn't January 0th, year 0 - Finding out some companies are "founded" in 2027 (yes, the future) - Dealing with double-encoded UTF-8 wrapped in Latin-1

WHAT YOU CAN NOW DO IN SQL

Find all fintechs founded after 2020 in São Paulo:

SELECT COUNT(*) FROM estabelecimentos e JOIN empresas emp ON e.cnpj_basico = emp.cnpj_basico WHERE e.uf = 'SP' AND e.cnae_fiscal_principal LIKE '64%' AND e.data_inicio_atividade > '2020-01-01' AND emp.porte IN ('01', '03');

Result: 8,426 companies (as of Jun 2025)

SURPRISING THINGS I FOUND

1. The 3am Company Club: 4,812 companies were "founded" at exactly 3:00:00 AM. Turns out this is a database migration artifact from the 1990s.

2. Ghost Companies: ~2% of "active" companies have no establishments (no address, no employees, nothing). They exist only on paper.

3. The CNAE 9999999 Mystery: 147 companies have an economic activity code that doesn't exist in any reference table. When I tracked them down, they're all government entities from before the classification system existed.

4. Future Founders: 89 companies have founding dates in 2025-2027. Not errors - they're pre-registered for future government projects.

5. The MEI Boom: Micro-entrepreneurs (MEI) grew 400% during COVID. You can actually see the exact week in March 2020 when registrations spiked.

TECHNICAL BITS

The pipeline: - Auto-detects your RAM and adapts strategy (streaming for <8GB, parallel for >32GB) - Uses PostgreSQL COPY instead of INSERT (10x faster) - Handles incremental updates (monthly data refresh) - Includes missing reference data from SERPRO that official files omit

Processing 60M companies: - VPS (4GB RAM): ~8 hours - Desktop (16GB): ~2 hours - Server (64GB): ~1 hour

THE CODE

It's MIT licensed: https://github.com/cnpj-chat/cnpj-data-pipeline

One command setup: docker-compose --profile postgres up --build

Or if you prefer Python: python setup.py # Interactive configuration python main.py # Start processing

WHY OPEN SOURCE THIS?

I've watched too many devs waste weeks on this same problem. One founder told me they hired a consultancy for R$30k to deliver... a broken CSV parser. Another spent 2 months building ETL that processes 10% of the data before crashing.

The Brazilian tech ecosystem loses tons of hours reinventing this wheel. That's time that could be spent building actual products.

COMMUNITY RESPONSE

I've shared this with r/dataengineering and r/brdev, and the response has been incredible - over 50k developers have viewed it, and I've already incorporated dozens of improvements from their feedback. The most common reaction? "I wish I had this last month when I spent 2 weeks fighting these files."

QUESTIONS FOR HN

1. What other government datasets are this painful? I'm thinking of tackling more.

2. For those who've worked with government data - what's your worst encoding/format horror story?

3. Is there interest in a hosted API version? The infrastructure would be ~$100/month to serve queries.

The worst part? This data has been "open" since 2012. But open != accessible. Sometimes the best code is the code that deals with reality's mess so others don't have to.

Comments

owebmaster•7mo ago
Hey brother that's quite the interesting project, thanks for publishing it! Let's made this great source of data available to more people, not only SERASA/Experian.

Uber held liable, ordered to pay $8.5M in driver rape suit

https://www.cnbc.com/2026/02/06/uber-liable-pay-8-5-million-driver-rape-suit.html
1•gslin•2m ago•0 comments

DayTradingCentral – Free Trading Journal (Next.js, NestJS, Postgres)

https://www.daytradingcentral.com
1•MuZzZ•2m ago•1 comments

Creative problem-solving of unsolved puzzles during REM sleep

https://academic.oup.com/nc/article/2026/1/niaf067/8456489
1•tchalla•9m ago•0 comments

Show HN: Language learning through AI example sentences (onigiri.kr)

https://jpen.onigiri.kr/
1•jaehakl•10m ago•0 comments

Wi-Fi 7 marketing is lying about its biggest feature [video]

https://www.youtube.com/watch?v=-5o_Qu3XToQ
2•wateralien•10m ago•0 comments

Thoughts on LLMs

https://finestructure.co/blog/2026/2/6/thoughts-on-llms
1•interpol_p•14m ago•0 comments

China's rare earth steel is transforming infrastructure [video]

https://www.youtube.com/watch?v=DfNN1Es02hI
1•zeristor•14m ago•0 comments

Show HN: CodeMic

https://codemic.io/#hn
1•seansh•14m ago•0 comments

How to build a hero section that gets you a chance

https://www.indiehackers.com/post/how-to-build-a-hero-section-that-actually-gets-you-a-chance-bff...
1•allinonetools_•15m ago•0 comments

Framework 13 Initial Impressions

https://www.abgn.me/posts/frame-work-13-initial-impressions
2•albingroen•15m ago•0 comments

Show HN: Peekr – An anonymous "Truth or Dare" game built with MERN

https://peekr-black.vercel.app/
1•peekrtrue•17m ago•1 comments

Casplist.eu

https://casplist.eu
1•PhilipV•24m ago•1 comments

OpenAI exec becomes top Trump donor with $25M gift

https://finance.yahoo.com/news/openai-exec-becomes-top-trump-230342268.html
7•doener•25m ago•0 comments

(AI) Slop Terrifies Me

https://ezhik.jp/ai-slop-terrifies-me/
2•Ezhik•25m ago•0 comments

Anthropic's team cut ad creation time from 30 minutes to 30 seconds

https://claude.com/blog/how-anthropic-uses-claude-marketing
2•Brajeshwar•34m ago•0 comments

Show HN: Elysia JIT "Compiler", why it's one of the fastest JavaScript framework

https://elysiajs.com/internal/jit-compiler
1•saltyaom•34m ago•0 comments

Cache Monet

https://cachemonet.com
1•keepamovin•35m ago•0 comments

Chinese Propaganda in Infomaniak's Euria, and a Reflection on Open Source AI

https://gagliardoni.net/#20260208_euria
1•tomgag•36m ago•1 comments

Show HN: A free, browser-only PDF tools collection built with Kimi k2.5

https://pdfuck.com
3•Justin3go•38m ago•0 comments

Curating a Show on My Ineffable Mother, Ursula K. Le Guin

https://hyperallergic.com/curating-a-show-on-my-ineffable-mother-ursula-k-le-guin/
2•bryanrasmussen•44m ago•0 comments

Show HN: HackerStack.dev – 49 Curated AI Tools for Indie Hackers

https://hackerstack.dev
1•pascalicchio•51m ago•0 comments

Pensions Are a Ponzi Scheme

https://poddley.com/?searchParams=segmentIds=b53ff41f-25c9-4f35-98d6-36616757d35b
2•onesandofgrain•57m ago•9 comments

Divvy.club – Splitwise alternative that makes sense

https://divvy.club
1•filepod•58m ago•0 comments

Betterment data breach exposes 1.4M customers

https://www.americanbanker.com/news/1-4-million-data-breach-betterment-shinyhunters-salesforce
2•NewCzech•58m ago•0 comments

MIT Technology Review has confirmed that posts on Moltbook were fake

https://www.technologyreview.com/2026/02/06/1132448/moltbook-was-peak-ai-theater/
3•helloplanets•58m ago•0 comments

Epstein Science: the people Epstein discussed scientific topics with

https://edge.dog/templates/cml9p8slu0009gdj2p0l8xf4r
2•castalian•59m ago•0 comments

Bambuddy – a free, self-hosted management system for Bambu Lab printers

https://bambuddy.cool
3•maziggy•1h ago•1 comments

Every Failed M4 Gun Replacement Attempt

https://www.youtube.com/watch?v=jrnAU67_EWg
3•tomaytotomato•1h ago•1 comments

China ramps up energy boom flagged by Musk as key to AI race

https://techxplore.com/news/2026-02-china-ramps-energy-boom-flagged.html
2•myk-e•1h ago•0 comments

Show HN: ClawBox – Dedicated OpenClaw Hardware (Jetson Orin Nano, 67 Tops, 20W)

https://openclawhardware.dev
2•superactro•1h ago•0 comments