Reinforcement learning, explained with a minimum of math and jargon

https://www.understandingai.org/p/reinforcement-learning-explained

192•JnBrymn•7mo ago

Comments

mnkv•7mo ago

reasonable post with a decent analogy explaining on-policy learning, only major thing I take issue with is

> Reinforcement learning is a technical subject—there are whole textbooks written about it.

and then linking to the still wip RLHF book instead of the book on RL: Sutton & Barto.

dawnofdusk•7mo ago

Haha that's crazy I'm so used to reading RL papers that when the blog linked to a textbook about RL I just filled in Sutton & Barto without clicking on the link or thinking any further about the matter.

I think the other criticism I have is that the historical importance of RLHF to ChatGPT is sort of sidelined, and the author at the beginning pinpoints something like the rise of agents as the beginning of the influence of RL in language modelling. In fact, the first LLM that attained widespread success was ChatGPT, and the secret sauce was RLHF... no need to start the story so late in 2023-2024.

Peteragain•7mo ago

Reinforcement Learning is basically sticks and carrots and the problem is credit assignment. Did I get hit with the stick because I said 5 plus 3 is 8? Or because I wrote my answers in green ink? Or... That used to be what RL was. S&B talk about "modern reinforcement learning" and introduce "Temporal Difference Learning", but imo the book is a bit of a rummage through GOFAI. Is the recent innovation with LLMs to perhaps use feedback to generate prompts? Talking about RL in this context does seem to be an attempt to freshen up interest. "Look! LLMs version 4.0! Now with added Science!"

vonnik•7mo ago

Another rl explainer:

https://wiki.pathmind.com/deep-reinforcement-learning

lsorber•7mo ago

For those who want to dive deeper, here’s a 300 LOC implementation of GRPO in pure NumPy: https://github.com/superlinear-ai/microGRPO

The implementation learns to play Battleship in about 2000 steps, pretty neat!

jekwoooooe•7mo ago

I don’t think it’s useful to explain things that are fundamentally mathematical by leaving out the math and tech. It’s a good article though

chrisweekly•7mo ago

(caveat: I haven't yet read the article)

Huh? Your 2nd sentence seems to contradict your 1st. Or is the article somehow "good" without being "useful"?

jekwoooooe•7mo ago

It was a good read on the concept but I’m left unsatisfied by hand waving all the stuff. Like how, physically, is the reinforcement actually saved? Is it a number in a file? What is the math behind the reward mechanism? What variables are changed and saved? What is the literal deliverable when you serve this to a client?

littlestymaar•7mo ago

> Huh? Your 2nd sentence seems to contradict your 1st. Or is the article somehow "good" without being "useful"?

The article isn't what the title say it is, so it's still good despite the title claim being questionable.

jxjnskkzxxhx•7mo ago

I would encourage everyone to read the Sutton and barto directly. Best technical book I've read past year. Though if you're trying to minimize math, the first edition is significantly simpler.

ivanbelenky•7mo ago

https://github.com/ivanbelenky/RL one the great pleasures in my life was implementing almost completely this book

jxjnskkzxxhx•7mo ago

Pretty cool thank you for sharing. How long did this take you?

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

The F Word

Software factories and the agentic moment

I write games in C (yes, C)

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Selection Rather Than Prediction

We mourn our craft

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

France's homegrown open source online office suite

72M Points of Interest

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

History and Timeline of the Proco Rat Pedal (2021)

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox