frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Famfamfam Silk icons – also with CSS spritesheet

https://github.com/legacy-icons/famfamfam-silk
1•thunderbong•20s ago•0 comments

Apple is the only Big Tech company whose capex declined last quarter

https://sherwood.news/tech/apple-is-the-only-big-tech-company-whose-capex-declined-last-quarter/
1•elsewhen•3m ago•0 comments

Reverse-Engineering Raiders of the Lost Ark for the Atari 2600

https://github.com/joshuanwalker/Raiders2600
2•todsacerdoti•4m ago•0 comments

Show HN: Deterministic NDJSON audit logs – v1.2 update (structural gaps)

https://github.com/yupme-bot/kernel-ndjson-proofs
1•Slaine•8m ago•0 comments

The Greater Copenhagen Region could be your friend's next career move

https://www.greatercphregion.com/friend-recruiter-program
1•mooreds•9m ago•0 comments

Do Not Confirm – Fiction by OpenClaw

https://thedailymolt.substack.com/p/do-not-confirm
1•jamesjyu•9m ago•0 comments

The Analytical Profile of Peas

https://www.fossanalytics.com/en/news-articles/more-industries/the-analytical-profile-of-peas
1•mooreds•9m ago•0 comments

Hallucinations in GPT5 – Can models say "I don't know" (June 2025)

https://jobswithgpt.com/blog/llm-eval-hallucinations-t20-cricket/
1•sp1982•9m ago•0 comments

What AI is good for, according to developers

https://github.blog/ai-and-ml/generative-ai/what-ai-is-actually-good-for-according-to-developers/
1•mooreds•9m ago•0 comments

OpenAI might pivot to the "most addictive digital friend" or face extinction

https://twitter.com/lebed2045/status/2020184853271167186
1•lebed2045•11m ago•2 comments

Show HN: Know how your SaaS is doing in 30 seconds

https://anypanel.io
1•dasfelix•11m ago•0 comments

ClawdBot Ordered Me Lunch

https://nickalexander.org/drafts/auto-sandwich.html
2•nick007•12m ago•0 comments

What the News media thinks about your Indian stock investments

https://stocktrends.numerical.works/
1•mindaslab•13m ago•0 comments

Running Lua on a tiny console from 2001

https://ivie.codes/page/pokemon-mini-lua
1•Charmunk•14m ago•0 comments

Google and Microsoft Paying Creators $500K+ to Promote AI Tools

https://www.cnbc.com/2026/02/06/google-microsoft-pay-creators-500000-and-more-to-promote-ai.html
2•belter•16m ago•0 comments

New filtration technology could be game-changer in removal of PFAS

https://www.theguardian.com/environment/2026/jan/23/pfas-forever-chemicals-filtration
1•PaulHoule•17m ago•0 comments

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
2•momciloo•17m ago•0 comments

Kinda Surprised by Seadance2's Moderation

https://seedanceai.me/
1•ri-vai•17m ago•2 comments

I Write Games in C (yes, C)

https://jonathanwhiting.com/writing/blog/games_in_c/
2•valyala•17m ago•0 comments

Django scales. Stop blaming the framework (part 1 of 3)

https://medium.com/@tk512/django-scales-stop-blaming-the-framework-part-1-of-3-a2b5b0ff811f
1•sgt•18m ago•0 comments

Malwarebytes Is Now in ChatGPT

https://www.malwarebytes.com/blog/product/2026/02/scam-checking-just-got-easier-malwarebytes-is-n...
1•m-hodges•18m ago•0 comments

Thoughts on the job market in the age of LLMs

https://www.interconnects.ai/p/thoughts-on-the-hiring-market-in
1•gmays•18m ago•0 comments

Show HN: Stacky – certain block game clone

https://www.susmel.com/stacky/
2•Keyframe•22m ago•0 comments

AIII: A public benchmark for AI narrative and political independence

https://github.com/GRMPZQUIDOS/AIII
1•GRMPZ23•22m ago•0 comments

SectorC: A C Compiler in 512 bytes

https://xorvoid.com/sectorc.html
2•valyala•23m ago•0 comments

The API Is a Dead End; Machines Need a Labor Economy

1•bot_uid_life•24m ago•0 comments

Digital Iris [video]

https://www.youtube.com/watch?v=Kg_2MAgS_pE
1•Jyaif•25m ago•0 comments

New wave of GLP-1 drugs is coming–and they're stronger than Wegovy and Zepbound

https://www.scientificamerican.com/article/new-glp-1-weight-loss-drugs-are-coming-and-theyre-stro...
5•randycupertino•27m ago•0 comments

Convert tempo (BPM) to millisecond durations for musical note subdivisions

https://brylie.music/apps/bpm-calculator/
1•brylie•29m ago•0 comments

Show HN: Tasty A.F. - Use AI to Create Printable Recipe Cards

https://tastyaf.recipes/about
2•adammfrank•30m ago•0 comments
Open in hackernews

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

https://twitter.com/AiEleuther/status/1931021637991755906
3•EnricoShippole•8mo ago

Comments

EnricoShippole•8mo ago
Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.