frontpage.

I’ve been working on this project mostly out of curiosity and because I like cleaning messy data. I downloaded the full French Wikipedia dump (the raw XML + wikitext one) and built a script that extracts every article and turns it into a clean JSON file.

Nothing fancy — just a pipeline I wrote myself that:

reads the XML stream

pulls out each page

removes wikitext and leftover markup

rebuilds the sections

parses the infobox into a real JSON object

extracts categories, links, etc.

and then saves everything as one JSON file per article

The result is around 2.7 million JSON files, each representing a single Wikipedia article in a format that’s directly usable for NLP or LLM experiments.

This wasn’t meant to compete with existing datasets — I just wanted to understand how to process the dump properly and build something clean from scratch. Since it turned out well, I’m sharing it in case it helps anyone.

I’m also running the same process on the full English dump (around 6.2M pages). Still in progress.

The chaos in the US is affecting open source software and its developers

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude