frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

EchoJEPA: Latent Predictive Foundation Model for Echocardiography

https://github.com/bowang-lab/EchoJEPA
1•euvin•3m ago•0 comments

Disablling Go Telemetry

https://go.dev/doc/telemetry
1•1vuio0pswjnm7•4m ago•0 comments

Effective Nihilism

https://www.effectivenihilism.org/
1•abetusk•8m ago•1 comments

The UK government didn't want you to see this report on ecosystem collapse

https://www.theguardian.com/commentisfree/2026/jan/27/uk-government-report-ecosystem-collapse-foi...
2•pabs3•10m ago•0 comments

No 10 blocks report on impact of rainforest collapse on food prices

https://www.thetimes.com/uk/environment/article/no-10-blocks-report-on-impact-of-rainforest-colla...
1•pabs3•10m ago•0 comments

Seedance 2.0 Is Coming

https://seedance-2.app/
1•Jenny249•11m ago•0 comments

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

https://apps.apple.com/us/app/fitspire-5-minute-workout/id6758784938
1•devavinoth12•12m ago•0 comments

Dexterous robotic hands: 2009 – 2014 – 2025

https://old.reddit.com/r/robotics/comments/1qp7z15/dexterous_robotic_hands_2009_2014_2025/
1•gmays•16m ago•0 comments

Interop 2025: A Year of Convergence

https://webkit.org/blog/17808/interop-2025-review/
1•ksec•26m ago•1 comments

JobArena – Human Intuition vs. Artificial Intelligence

https://www.jobarena.ai/
1•84634E1A607A•29m ago•0 comments

Concept Artists Say Generative AI References Only Make Their Jobs Harder

https://thisweekinvideogames.com/feature/concept-artists-in-games-say-generative-ai-references-on...
1•KittenInABox•33m ago•0 comments

Show HN: PaySentry – Open-source control plane for AI agent payments

https://github.com/mkmkkkkk/paysentry
1•mkyang•35m ago•0 comments

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

https://moli-green.is/
1•ShinyaKoyano•45m ago•0 comments

The Crumbling Workflow Moat: Aggregation Theory's Final Chapter

https://twitter.com/nicbstme/status/2019149771706102022
1•SubiculumCode•49m ago•0 comments

Pax Historia – User and AI powered gaming platform

https://www.ycombinator.com/launches/PMu-pax-historia-user-ai-powered-gaming-platform
2•Osiris30•50m ago•0 comments

Show HN: I built a RAG engine to search Singaporean laws

https://github.com/adityaprasad-sudo/Explore-Singapore
1•ambitious_potat•56m ago•0 comments

Scams, Fraud, and Fake Apps: How to Protect Your Money in a Mobile-First Economy

https://blog.afrowallet.co/en_GB/tiers-app/scams-fraud-and-fake-apps-in-africa
1•jonatask•56m ago•0 comments

Porting Doom to My WebAssembly VM

https://irreducible.io/blog/porting-doom-to-wasm/
2•irreducible•56m ago•0 comments

Cognitive Style and Visual Attention in Multimodal Museum Exhibitions

https://www.mdpi.com/2075-5309/15/16/2968
1•rbanffy•58m ago•0 comments

Full-Blown Cross-Assembler in a Bash Script

https://hackaday.com/2026/02/06/full-blown-cross-assembler-in-a-bash-script/
1•grajmanu•1h ago•0 comments

Logic Puzzles: Why the Liar Is the Helpful One

https://blog.szczepan.org/blog/knights-and-knaves/
1•wasabi991011•1h ago•0 comments

Optical Combs Help Radio Telescopes Work Together

https://hackaday.com/2026/02/03/optical-combs-help-radio-telescopes-work-together/
2•toomuchtodo•1h ago•1 comments

Show HN: Myanon – fast, deterministic MySQL dump anonymizer

https://github.com/ppomes/myanon
1•pierrepomes•1h ago•0 comments

The Tao of Programming

http://www.canonical.org/~kragen/tao-of-programming.html
2•alexjplant•1h ago•0 comments

Forcing Rust: How Big Tech Lobbied the Government into a Language Mandate

https://medium.com/@ognian.milanov/forcing-rust-how-big-tech-lobbied-the-government-into-a-langua...
4•akagusu•1h ago•1 comments

PanelBench: We evaluated Cursor's Visual Editor on 89 test cases. 43 fail

https://www.tryinspector.com/blog/code-first-design-tools
2•quentinrl•1h ago•2 comments

Can You Draw Every Flag in PowerPoint? (Part 2) [video]

https://www.youtube.com/watch?v=BztF7MODsKI
1•fgclue•1h ago•0 comments

Show HN: MCP-baepsae – MCP server for iOS Simulator automation

https://github.com/oozoofrog/mcp-baepsae
1•oozoofrog•1h ago•0 comments

Make Trust Irrelevant: A Gamer's Take on Agentic AI Safety

https://github.com/Deso-PK/make-trust-irrelevant
9•DesoPK•1h ago•4 comments

Show HN: Sem – Semantic diffs and patches for Git

https://ataraxy-labs.github.io/sem/
1•rs545837•1h ago•1 comments
Open in hackernews

Show HN: I Processed Brazil's 85GB Open Company Registry So You Don't Have To

https://github.com/cnpj-chat/cnpj-data-pipeline
4•caiopizzol•7mo ago
Last year, I needed to find all software companies in São Paulo for a project. The good news: Brazil publishes all company registrations as open data at dados.gov.br. The bad news: it's 85GB of ISO-8859-1 encoded CSVs with semicolon delimiters, decimal commas, and dates like "00000000" meaning NULL. My laptop crashed after 4 hours trying to import just one file.

So I built a pipeline that handles this mess: https://github.com/cnpj-chat/cnpj-data-pipeline

THE PROBLEM NOBODY TALKS ABOUT

Every Brazilian startup eventually needs this data - for market research, lead generation, or compliance. But everyone wastes weeks: - Parsing "12.345.678/0001-90" vs "12345678000190" CNPJ formats - Discovering that "00000000" isn't January 0th, year 0 - Finding out some companies are "founded" in 2027 (yes, the future) - Dealing with double-encoded UTF-8 wrapped in Latin-1

WHAT YOU CAN NOW DO IN SQL

Find all fintechs founded after 2020 in São Paulo:

SELECT COUNT(*) FROM estabelecimentos e JOIN empresas emp ON e.cnpj_basico = emp.cnpj_basico WHERE e.uf = 'SP' AND e.cnae_fiscal_principal LIKE '64%' AND e.data_inicio_atividade > '2020-01-01' AND emp.porte IN ('01', '03');

Result: 8,426 companies (as of Jun 2025)

SURPRISING THINGS I FOUND

1. The 3am Company Club: 4,812 companies were "founded" at exactly 3:00:00 AM. Turns out this is a database migration artifact from the 1990s.

2. Ghost Companies: ~2% of "active" companies have no establishments (no address, no employees, nothing). They exist only on paper.

3. The CNAE 9999999 Mystery: 147 companies have an economic activity code that doesn't exist in any reference table. When I tracked them down, they're all government entities from before the classification system existed.

4. Future Founders: 89 companies have founding dates in 2025-2027. Not errors - they're pre-registered for future government projects.

5. The MEI Boom: Micro-entrepreneurs (MEI) grew 400% during COVID. You can actually see the exact week in March 2020 when registrations spiked.

TECHNICAL BITS

The pipeline: - Auto-detects your RAM and adapts strategy (streaming for <8GB, parallel for >32GB) - Uses PostgreSQL COPY instead of INSERT (10x faster) - Handles incremental updates (monthly data refresh) - Includes missing reference data from SERPRO that official files omit

Processing 60M companies: - VPS (4GB RAM): ~8 hours - Desktop (16GB): ~2 hours - Server (64GB): ~1 hour

THE CODE

It's MIT licensed: https://github.com/cnpj-chat/cnpj-data-pipeline

One command setup: docker-compose --profile postgres up --build

Or if you prefer Python: python setup.py # Interactive configuration python main.py # Start processing

WHY OPEN SOURCE THIS?

I've watched too many devs waste weeks on this same problem. One founder told me they hired a consultancy for R$30k to deliver... a broken CSV parser. Another spent 2 months building ETL that processes 10% of the data before crashing.

The Brazilian tech ecosystem loses tons of hours reinventing this wheel. That's time that could be spent building actual products.

COMMUNITY RESPONSE

I've shared this with r/dataengineering and r/brdev, and the response has been incredible - over 50k developers have viewed it, and I've already incorporated dozens of improvements from their feedback. The most common reaction? "I wish I had this last month when I spent 2 weeks fighting these files."

QUESTIONS FOR HN

1. What other government datasets are this painful? I'm thinking of tackling more.

2. For those who've worked with government data - what's your worst encoding/format horror story?

3. Is there interest in a hosted API version? The infrastructure would be ~$100/month to serve queries.

The worst part? This data has been "open" since 2012. But open != accessible. Sometimes the best code is the code that deals with reality's mess so others don't have to.

Comments

owebmaster•7mo ago
Hey brother that's quite the interesting project, thanks for publishing it! Let's made this great source of data available to more people, not only SERASA/Experian.