frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Cleaned 2.7M French Wikipedia JSON articles (full dataset)"

https://huggingface.co/datasets/YoloMG/wikipedia-fr-2.7m-clean-json
1•zeronex•1h ago
I’ve been working on this project mostly out of curiosity and because I like cleaning messy data. I downloaded the full French Wikipedia dump (the raw XML + wikitext one) and built a script that extracts every article and turns it into a clean JSON file.

Nothing fancy — just a pipeline I wrote myself that:

reads the XML stream

pulls out each page

removes wikitext and leftover markup

rebuilds the sections

parses the infobox into a real JSON object

extracts categories, links, etc.

and then saves everything as one JSON file per article

The result is around 2.7 million JSON files, each representing a single Wikipedia article in a format that’s directly usable for NLP or LLM experiments.

This wasn’t meant to compete with existing datasets — I just wanted to understand how to process the dump properly and build something clean from scratch. Since it turned out well, I’m sharing it in case it helps anyone.

I’m also running the same process on the full English dump (around 6.2M pages). Still in progress.

An Escape from India's Air Pollution for Those Who Can Afford It

https://www.nytimes.com/2025/11/14/business/india-pollution-clean-air.html
1•cainxinth•2m ago•0 comments

Samsung hikes memory chip prices by up to 60% as shortage worsens, sources say

https://www.reuters.com/world/china/samsung-hikes-memory-chip-prices-by-up-60-shortage-worsens-so...
1•Brajeshwar•5m ago•1 comments

Show HN: StoryMotion – animated diagrams, step-by-step explainer animation maker

https://storymotion.video
1•chunza2542•5m ago•1 comments

Vercel Streamdown – Markdown for AI Streaming

https://streamdown.ai/
2•rob•9m ago•0 comments

European Tech News in 6 Languages

https://europedigital.cloud/en/news
2•Merinov•9m ago•3 comments

Why building a loved product is the #1 startup success factor – Sam Altman

https://onnetpulse.com/why-building-a-loved-product-is-the-1-startup-success-factor-sam-altman/
2•Contributor_G•9m ago•0 comments

Mullvad VPN present And Then? (Chat Control is back on the menu)

https://mullvad.net/en/blog/mullvad-vpn-present-and-then
1•dotcoma•14m ago•0 comments

Show HN: I made a fireplace for your wrist (and widgets)

4•kingofspain•16m ago•0 comments

MCP Is Anthropic Biggest Mistake

https://medium.com/@anwarzaid76/mcp-is-anthropics-biggest-mistake-and-we-re-all-paying-for-it-b5d...
2•MindBreaker2605•17m ago•0 comments

A Trip Around Our Surprisingly Psychedelic Planet

https://nautil.us/a-trip-around-our-surprisingly-psychedelic-planet-1247451/
1•the-mitr•17m ago•0 comments

What if the aliens come and we just can't communicate?

https://arstechnica.com/science/2025/11/what-if-the-aliens-come-and-we-just-cant-communicate/
3•pseudolus•18m ago•0 comments

Gnome 50 Ends the X11 Era After Decades

https://linuxiac.com/gnome-50-ends-the-x11-era-after-decades/
3•upofadown•18m ago•0 comments

Ask HN: Is it possible to implement this button on a browser?

1•bguberfain•19m ago•0 comments

Artificially intelligent agents in the social and behavioral sciences: A history

https://arxiv.org/abs/2510.05743
1•Anon84•21m ago•0 comments

AI at the speed of light just became a possibility

https://techxplore.com/news/2025-11-ai-possibility.html
1•pseudolus•21m ago•0 comments

Ask HN: Is Java or Kotlin the best future programming language?

2•roschdal•22m ago•1 comments

2026 Hyundai Ioniq 9: American car-buyer tastes meet Korean EV tech

https://arstechnica.com/cars/2025/10/a-week-with-the-hyundai-ioniq-9-suv-what-we-liked-what-we-di...
1•PaulHoule•24m ago•0 comments

Elementary Symmetric Polynomials and Optimization

https://www.johndcook.com/blog/2025/11/12/elementary-symmetric-polynomials/
1•ibobev•24m ago•0 comments

Four Generalizations of the Pythagorean Theorem

https://www.johndcook.com/blog/2025/11/13/pythagorean-generalizations/
1•ibobev•25m ago•0 comments

Anthropic Says Chinese Hackers Used Its A.I. In Online Attack

https://www.nytimes.com/2025/11/14/business/chinese-hackers-artificial-intelligence.html
1•furcyd•25m ago•0 comments

Google Files Lawsuit to Dismantle 'Lighthouse' Smishing Kit

https://techoreon.com/google-sues-lighthouse-phishing-kit/
1•ashishgupta2209•25m ago•0 comments

The price of dynamic memory: Memory Access (2020)

https://johnnysswlab.com/the-price-of-dynamic-memory-memory-access/
1•signa11•25m ago•0 comments

These are the 37 donors helping pay for Trump's $300M White House ballroom

https://apnews.com/article/donors-to-trump-white-house-ballroom-d4dd174eeb30ac244354a5a25551a86b
2•teleforce•27m ago•0 comments

Show HN: Encore – Type-safe back end framework that generates infra from code

https://github.com/encoredev/encore
6•andout_•28m ago•1 comments

Programming principles for self taught front-end developers

https://piccalil.li/blog/programming-principles-for-self-taught-front-end-developers/
1•kilian•28m ago•0 comments

From Collaborators to Consumers: Have We Killed the Soul of Open Source?

https://my-notes.dragas.net/2025/06/19/from-collaborators-to-consumers-have-we-killed-the-soul-of...
1•upofadown•29m ago•0 comments

Mechatronic System Design

https://ocw.tudelft.nl/courses/mechatronic-system-design/
1•pillars•31m ago•0 comments

Port the Lua REPL to the RP2350 (Chinese)

https://www.ruanx.net/rp2350-lua/
2•uneven9434•31m ago•0 comments

The Future of Search: Will we still Google it?

https://www.lrb.co.uk/the-paper/v47/n21/donald-mackenzie/the-future-of-search
1•gHeadphone•32m ago•0 comments

HippoMaps: Multiscale cartography of human hippocampal organization

https://www.nature.com/articles/s41592-025-02783-3
1•bryanrasmussen•34m ago•0 comments