NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

53•sdpmas•2h ago

Comments

suddenlybananas•1h ago

Reminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.

sdpmas•1h ago

hey, it's Samip (behind the Slowrun repo). yeah that's a fair point, we will mention them in the blog. but there are a couple of major differences: 1. our emphasis is on using more compute to get better data efficiency. this is important because there are lots of hacky chances that will get lower loss, but when compared to general methods that leverage a lot of compute, they don't do so well. and you can already see how this emphasis on compute leads to different methods to BabyLM! 2. our reasoning behind the repo is not anything to do with how much data a child sees. and our dataset is not tailored towards that either. it's simple pretraining on random subset of the internet. we know there are better training algorithms that get lower loss on that data, and we are finding those.

soraki_soladead•1h ago

also, BabyLM is more of a conference track / workshop than an open-repo competition which creates a different vibe

archermarks•51m ago

Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.

sdpmas•44m ago

yes, good point. right now, it's somewhat hard to overfit because the meta-optimization extracts tiny bits of information. but over time, we will switch the validation set to some other random subset of the FineWeb or even entirely OOD datasets!

lzaborowski•22m ago

I like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.

If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.

navvyeanand•16m ago

Amazing job!

An interactive map of FLock Cams

MacBook Neo

Making Firefox's right-click not suck with about:config

Father claims Google's AI product fuelled son's delusional spiral

Something is afoot in the land of Qwen

Nobody Gets Promoted for Simplicity

NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

Moss is a pixel canvas where every brush is a tiny program

Data Has Weight but Only on SSDs

“It turns out” (2010)

Roboflow (YC S20) Is Hiring a Security Engineer for AI Infra

Who Writes the Bugs? A Deeper Look at 125,000 Kernel Vulnerabilities

Faster C software with Dynamic Feature Detection

Raspberry Pi Pico as AM Radio Transmitter

Glaze by Raycast

Qwen3.5 Fine-Tuning Guide – Unsloth Documentation

My Favorite 39C3 Talks

Libre Solar – Open Hardware for Renewable Energy

MyFirst Kids Watch Hacked. Access to Camera and Microphone

Agentic Engineering Patterns

The Space Race's Forgotten Theme Park

Google ends its 30 percent app store fee and welcomes third-party app stores

RFC 9849. TLS Encrypted Client Hello

TikTok will not introduce end-to-end encryption, saying it makes users less safe

Government grant-funded research should not be published in for-profit journals

Emails to Outlook.com rejected due to a fault or overzealous blocking rules

Motorola GrapheneOS devices will be bootloader unlockable/relockable

The 1,700-year-old megastructure history almost forgot

RE#: how we built the fastest regex engine in F#

A CPU that runs entirely on GPU

An interactive map of FLock Cams

MacBook Neo

Making Firefox's right-click not suck with about:config

Father claims Google's AI product fuelled son's delusional spiral

Something is afoot in the land of Qwen

Nobody Gets Promoted for Simplicity

NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

Moss is a pixel canvas where every brush is a tiny program

Data Has Weight but Only on SSDs

“It turns out” (2010)

Roboflow (YC S20) Is Hiring a Security Engineer for AI Infra

Who Writes the Bugs? A Deeper Look at 125,000 Kernel Vulnerabilities

Faster C software with Dynamic Feature Detection

Raspberry Pi Pico as AM Radio Transmitter

Glaze by Raycast

Qwen3.5 Fine-Tuning Guide – Unsloth Documentation

My Favorite 39C3 Talks

Libre Solar – Open Hardware for Renewable Energy

MyFirst Kids Watch Hacked. Access to Camera and Microphone

Agentic Engineering Patterns

The Space Race's Forgotten Theme Park

Google ends its 30 percent app store fee and welcomes third-party app stores

RFC 9849. TLS Encrypted Client Hello

TikTok will not introduce end-to-end encryption, saying it makes users less safe

Government grant-funded research should not be published in for-profit journals

Emails to Outlook.com rejected due to a fault or overzealous blocking rules

Motorola GrapheneOS devices will be bootloader unlockable/relockable

The 1,700-year-old megastructure history almost forgot

RE#: how we built the fastest regex engine in F#

A CPU that runs entirely on GPU

NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

Comments