frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

There's no such thing as "tech" (Ten years later)

1•dtjb•26s ago•0 comments

What Really Killed Flash Player: A Six-Year Campaign of Deliberate Platform Work

https://medium.com/@aglaforge/what-really-killed-flash-player-a-six-year-campaign-of-deliberate-p...
1•jbegley•57s ago•0 comments

Ask HN: Anyone orchestrating multiple AI coding agents in parallel?

1•buildingwdavid•2m ago•0 comments

Show HN: Knowledge-Bank

https://github.com/gabrywu-public/knowledge-bank
1•gabrywu•7m ago•0 comments

Show HN: The Codeverse Hub Linux

https://github.com/TheCodeVerseHub/CodeVerseLinuxDistro
3•sinisterMage•8m ago•0 comments

Take a trip to Japan's Dododo Land, the most irritating place on Earth

https://soranews24.com/2026/02/07/take-a-trip-to-japans-dododo-land-the-most-irritating-place-on-...
2•zdw•9m ago•0 comments

British drivers over 70 to face eye tests every three years

https://www.bbc.com/news/articles/c205nxy0p31o
6•bookofjoe•9m ago•1 comments

BookTalk: A Reading Companion That Captures Your Voice

https://github.com/bramses/BookTalk
1•_bramses•10m ago•0 comments

Is AI "good" yet? – tracking HN's sentiment on AI coding

https://www.is-ai-good-yet.com/#home
1•ilyaizen•11m ago•1 comments

Show HN: Amdb – Tree-sitter based memory for AI agents (Rust)

https://github.com/BETAER-08/amdb
1•try_betaer•11m ago•0 comments

OpenClaw Partners with VirusTotal for Skill Security

https://openclaw.ai/blog/virustotal-partnership
2•anhxuan•11m ago•0 comments

Show HN: Seedance 2.0 Release

https://seedancy2.com/
2•funnycoding•12m ago•0 comments

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

https://spillhistorie.no/2026/02/06/interview-with-sierra-veteran-al-lowe/
1•thelok•12m ago•0 comments

Towards Self-Driving Codebases

https://cursor.com/blog/self-driving-codebases
1•edwinarbus•12m ago•0 comments

VCF West: Whirlwind Software Restoration – Guy Fedorkow [video]

https://www.youtube.com/watch?v=YLoXodz1N9A
1•stmw•13m ago•1 comments

Show HN: COGext – A minimalist, open-source system monitor for Chrome (<550KB)

https://github.com/tchoa91/cog-ext
1•tchoa91•14m ago•1 comments

FOSDEM 26 – My Hallway Track Takeaways

https://sluongng.substack.com/p/fosdem-26-my-hallway-track-takeaways
1•birdculture•15m ago•0 comments

Show HN: Env-shelf – Open-source desktop app to manage .env files

https://env-shelf.vercel.app/
1•ivanglpz•18m ago•0 comments

Show HN: Almostnode – Run Node.js, Next.js, and Express in the Browser

https://almostnode.dev/
1•PetrBrzyBrzek•19m ago•0 comments

Dell support (and hardware) is so bad, I almost sued them

https://blog.joshattic.us/posts/2026-02-07-dell-support-lawsuit
1•radeeyate•19m ago•0 comments

Project Pterodactyl: Incremental Architecture

https://www.jonmsterling.com/01K7/
1•matt_d•20m ago•0 comments

Styling: Search-Text and Other Highlight-Y Pseudo-Elements

https://css-tricks.com/how-to-style-the-new-search-text-and-other-highlight-pseudo-elements/
1•blenderob•21m ago•0 comments

Crypto firm accidentally sends $40B in Bitcoin to users

https://finance.yahoo.com/news/crypto-firm-accidentally-sends-40-055054321.html
1•CommonGuy•22m ago•0 comments

Magnetic fields can change carbon diffusion in steel

https://www.sciencedaily.com/releases/2026/01/260125083427.htm
1•fanf2•23m ago•0 comments

Fantasy football that celebrates great games

https://www.silvestar.codes/articles/ultigamemate/
1•blenderob•23m ago•0 comments

Show HN: Animalese

https://animalese.barcoloudly.com/
1•noreplica•23m ago•0 comments

StrongDM's AI team build serious software without even looking at the code

https://simonwillison.net/2026/Feb/7/software-factory/
3•simonw•24m ago•0 comments

John Haugeland on the failure of micro-worlds

https://blog.plover.com/tech/gpt/micro-worlds.html
1•blenderob•24m ago•0 comments

Show HN: Velocity - Free/Cheaper Linear Clone but with MCP for agents

https://velocity.quest
2•kevinelliott•25m ago•2 comments

Corning Invented a New Fiber-Optic Cable for AI and Landed a $6B Meta Deal [video]

https://www.youtube.com/watch?v=Y3KLbc5DlRs
1•ksec•26m ago•0 comments
Open in hackernews

Show HN: Cleaned 2.7M French Wikipedia JSON articles (full dataset)"

https://huggingface.co/datasets/YoloMG/wikipedia-fr-2.7m-clean-json
1•zeronex•2mo ago
I’ve been working on this project mostly out of curiosity and because I like cleaning messy data. I downloaded the full French Wikipedia dump (the raw XML + wikitext one) and built a script that extracts every article and turns it into a clean JSON file.

Nothing fancy — just a pipeline I wrote myself that:

reads the XML stream

pulls out each page

removes wikitext and leftover markup

rebuilds the sections

parses the infobox into a real JSON object

extracts categories, links, etc.

and then saves everything as one JSON file per article

The result is around 2.7 million JSON files, each representing a single Wikipedia article in a format that’s directly usable for NLP or LLM experiments.

This wasn’t meant to compete with existing datasets — I just wanted to understand how to process the dump properly and build something clean from scratch. Since it turned out well, I’m sharing it in case it helps anyone.

I’m also running the same process on the full English dump (around 6.2M pages). Still in progress.