frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

https://twitter.com/AiEleuther/status/1931021637991755906
2•EnricoShippole•11h ago

Comments

EnricoShippole•11h ago
Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

AMC Says It Will Show More Ads Before Movies

https://www.nytimes.com/2025/06/06/business/movies-theaters-ads-amc.html
2•cebert•2m ago•1 comments

Getting C++ Hello World working on Windows (a comedy & tragedy)

https://sdegutis.github.io/blog/creating-cpp-hello-world.html
1•90s_dev•4m ago•0 comments

NASA delays next flight of Boeing's alternative to SpaceX Dragon

https://theedgemalaysia.com/node/758199
1•bookmtn•6m ago•0 comments

Can Schrodinger's Cat Factor Numbers?

https://mathpages.com/home/kmath013/kmath013.htm
1•gametorch•7m ago•0 comments

NASA Delays Next Flight of Boeing's Alternative to SpaceX Dragon

https://www.bloomberg.com/news/articles/2025-06-06/nasa-delays-next-flight-of-boeing-s-alternative-to-spacex-dragon
1•bookmtn•8m ago•0 comments

California AG vows crack down on copper wire thefts in the state

https://abc7.com/post/california-ag-rob-bonta-vows-crack-down-copper-wire-thefts-state/16678391/
1•lxm•9m ago•0 comments

Show HN: A photo backup idea – to your own storage, not iCloud/Google

https://myphoto-vault.netlify.app/
1•Nainiket•14m ago•0 comments

Trump administration races to fix a big mistake: DOGE fired too many people

https://www.washingtonpost.com/business/2025/06/06/doge-staff-cuts-rehiring-federal-workers/
4•MilnerRoute•16m ago•0 comments

Getting Past Procastination

https://spectrum.ieee.org/getting-past-procastination
1•WaitWaitWha•16m ago•0 comments

Reverse Engineering Cursor's LLM Client

https://www.tensorzero.com/blog/reverse-engineering-cursors-llm-client/
1•paulwarren•23m ago•0 comments

Show HN: Cpdown – Copy any webpage/YouTube subtitle as clean Markdown(LLM-ready)

https://github.com/ysm-dev/cpdown
1•ysm0622•26m ago•0 comments

Pentagon Disinformation Fueled America's UFO Mythology

https://www.wsj.com/politics/national-security/ufo-us-disinformation-45376f7e
1•doener•28m ago•0 comments

Open-source code repos open to supply chain attacks, researchers warn

https://www.scworld.com/news/open-source-code-repos-open-to-supply-chain-attacks-researchers-warn
2•ricecat•30m ago•0 comments

Ask HN: What non-AI projects are you working on?

2•kikki•37m ago•2 comments

Nintendo Switch 2 Teardown [video]

https://www.youtube.com/watch?v=RvD1OCHhhS0
1•Lwrless•38m ago•0 comments

TSA urges people to stop trying to use a Costco card as a sufficient Real ID

https://www.wsfa.com/2025/06/06/tsa-urges-people-stop-trying-use-costco-card-sufficient-real-id/
2•sharkweek•41m ago•0 comments

The reason Indians are lost

https://www.economist.com/asia/2025/06/05/the-real-reason-indians-are-lost
2•RestlessMind•48m ago•0 comments

Ask HN: Why are job descriptions and resumes so bad?

1•throwaway123198•49m ago•0 comments

Show HN: Pcrassist.com – AI powered report assistant for EMTs

https://pcrassist.com/
1•josdijkstra•58m ago•0 comments

Error Monads the Hard Way

https://articles.pragdave.me/p/error-monads-the-hard-way
1•thunderbong•1h ago•0 comments

Show HN: C++ SFML Game Engine for Nintendo Switch, Web (HTML5), PC and Mobile

https://github.com/Is-Daouda/is-Engine
1•Is_Daouda•1h ago•0 comments

Musk's XAI Is Trying to Borrow $5B While His Relationship with Trump Blows Up

https://www.wsj.com/finance/musks-xai-is-trying-to-borrow-5-billion-while-his-relationship-with-trump-blows-up-4b963361
2•TheAlchemist•1h ago•0 comments

We Should Immediately Nationalize SpaceX and Starlink

https://jacobin.com/2025/06/musk-trump-nationalize-spacex-starlink
4•Improvement•1h ago•8 comments

ACLU sues Sonoma County, alleges illegal drone surveillance program

https://www.ktvu.com/news/aclu-sues-sonoma-county-alleges-illegal-drone-surveillance-program
3•walterbell•1h ago•0 comments

Show HN: Email Scraper for Instagram

https://chromewebstore.google.com/detail/email-scraper-for-ins/nhgbjmidfpboihkaechkkmbiimecddda
1•qwikhost•1h ago•0 comments

A New System Aims to Save Injured Brains and Lives

https://www.nytimes.com/2025/05/20/health/traumatic-brain-injury-tbi-guidelines.html
1•bookofjoe•1h ago•1 comments

How to Turn an Acquaintance into a Friend

https://talk.bradwoods.io/blog/generous-with-disclosure/
3•bradwoodsio•1h ago•0 comments

Show HN: We built a free AI assistant that finds Amazon products instantly

https://www.sweepvalet.com/
2•felixthecat23•1h ago•0 comments

Ask HN: A Tetris variant with greater tactical and strategic depth?

2•amichail•1h ago•0 comments

Ask HN: Tacit knowledge video you've seen?

1•rahimnathwani•1h ago•0 comments