Nothing fancy — just a pipeline I wrote myself that:
reads the XML stream
pulls out each page
removes wikitext and leftover markup
rebuilds the sections
parses the infobox into a real JSON object
extracts categories, links, etc.
and then saves everything as one JSON file per article
The result is around 2.7 million JSON files, each representing a single Wikipedia article in a format that’s directly usable for NLP or LLM experiments.
This wasn’t meant to compete with existing datasets — I just wanted to understand how to process the dump properly and build something clean from scratch. Since it turned out well, I’m sharing it in case it helps anyone.
I’m also running the same process on the full English dump (around 6.2M pages). Still in progress.