New AI tutor achieves 0.71-1.30 SD effect size in Dartmouth course [pdf]

https://intextbooks.science.uu.nl/workshop2026/files/itb26_s1s2.pdf

47•jonahbard•1h ago

Comments

boulos•1h ago

Do you have a larger study planned for the Fall? It definitely seems promising.

I'm curious how well you feel this worked because the subject was Statistics (objective grading) versus something more subjective like Civics or Literature.

PS - I'd say this qualifies for Show HN, too!

Do you

ilaksh•36m ago

They were using Sonnet 4.6 for some fre form responses so that could be applied to something subjective.

albinahlback•55m ago

Very nicely typeset.

kubb•49m ago

Too bad the educational use case doesn't make any money. Good LLMs are a game changer for people motivated to learn.

TheLML•40m ago

I don't want to learn from hallucinations where it will change its answers based on me questioning their teachings. I use it for conversations in a language I'm learning, but I quickly learned that asking it grammar questions for example is not a wise decision.

afro88•30m ago

Curious whether you were just bare asking it questions, or whether you provided it with lessons one by one with instruction that the lesson is the baseline truth etc

treis•24m ago

Are we talking about human teachers or LLMs here?

Robotbeat•39m ago

Wikipedia doesn’t make much money but is still helpful. LLMs don’t need to make a whole bunch of money to be helpful.

kubb•31m ago

People aren't paying trillions to train them to be helpful. They want to make quadrillions.

Rperry2174•45m ago

Honestly whether or not this was effective seems less important to me than the adoption numbers.

Text book reading in this course was 10-15% at baseline ... but this AI thing got 90% voluntary usage ungraded.

Even if its worse per-hour than a textbook, you're now teaching 6x as many students _something_ instead of teaching a small minority everything.

So really it just becomes an optimization problem at that point because most students are at least in the funnel/in the running to learn something.

The paper kind of proves this itself ... they tweaked the quize formats mid-semester and where able to iterate which you can't do on a textbook that nobody opens in the first place

baq•33m ago

I'd argue the results are even better: just reading a textbook doesn't really teach you much. You have to do exercises, but they're expensive to create and grade. LLMs with a proper harness (see paper) tackle both.

rusbus•42m ago

This is exciting because the effect size is so large. But as the author's acknowledged, selection bias is nearly impossible to control for in this non-randomized study:

> and lacks randomized controls. Self-selection is the central threat: students who complete more quizzes may be more motivated or higher-performing generally

But this is still a strong result. I'm excited to see more in this space.

rahimnathwani•17m ago

They tried to control for this. It's described in the first paragraph of section 4.

constantius•40m ago

Interesting, congrats.

Are you planning on opening access to Phosphor?

baq•35m ago

I'm on record saying that a system like this with some extra hardware (i.e. a way for the LLM to have live understanding of the student's paper notebook or handout which are being written in with a plain old pencil) combines the best of both worlds - individual tutoring with approximately zero screen time which scales linearly with the number of students. The role of the teacher or professor then becomes a manager of the student - agentic tutor pairs, a referee when the student and model disagree, etc. and most importantly still being the human teacher you can just talk to in the human education process.

I'm convinced this is the future of education - models are there, we need the classroom tech to catch up. The alternative is obvious and quantified in the paper - students just use models to do their work for them and learn nothing.

terribleperson•10m ago

A 'smart pen' that records the student's writing in some way, maybe? My first thought was a tablet that boots straight into a writing software but students should not be subjected to any amount of latency in their writing.

Practically, I think if you want the AI system to have a live view of what the student's doing you're going to have to replace one of either the tablet or the writing instrument. A wearable camera could work as well but there are issues with that.

Buttons840•9m ago

I would add that somewhere in there should be a spaced repetition algorithm.

Spaced repetition is very effective, but it's really really clunky to use. My unpopular opinion is that we all have Stockholm syndrome when it comes to creating "cards", and people talk about how valuable creating cards is; but I think it stucks, it takes a lot of time.

If AI is already teaching me math (let's say), it would be nice to tell the AI/app "quiz me on this periodically", and then the AI makes up a fresh polynomial to factor (or whatever) and presents that to you according to a spaced repetition algorithm.

Behind the scenes, the AI should have access to what has happened the last several times a specific topic has been quized, so the AI can watch to see that certain mistakes are resolved, and the AI might also know better how to correct the user if it has context about previous quizzes of that topic.

ilaksh•30m ago

Shocking that a well executed AI tutor improves outcomes.

Hasn't computer assisted interactive learning already been proven for years? Why does there seem to be so much skepticism about enhancing it with AI?

Is this just something like, astoundingly slow adoption or poor execution? Being held back by paper textbook makers? Teachers unions dragging their feet?

How can interactive AI driven individually paced learning _not_ be obviously dramatically more effective?

dominotw•29m ago

its like anything else. benifits students that are already motivated to learn.

very few are actually motivated to learn and are just there to get a job or its just next thing that they have to do in life.

mmarian•27m ago

Conflicted about this study. On one hand, LLMs have been incredible for my personal learnings of new concepts.

On the other, I'm sceptical of that it'll have "strong benefits" at scale; I'd be more in favor if the wording was "some"/"moderate". I reckon self-selection plays a huge part, as mentioned in the "Limitations" section of the paper.

I'd also caution against attaching the tool to grading. That means students have to put more effort into the course, which increases the chances that they will use LLMs to save time rather than make the investment.

MoneyBurning•16m ago

Curious how this holds up across different learning styles. SD effect sizes look impressive, but I'd want to see retention data at 30/90 days before drawing conclusions.

isomorphic_duck•13m ago

Why did you make a new account to spam AI comments?

Organic Maps

New AI tutor achieves 0.71-1.30 SD effect size in Dartmouth course [pdf]

Starring the Computer

The future of Flipper Zero development

Do you hate XML? (2010)

It's not about physical vs. digital games, it's about ownership

The great blogging collapse: What happened to 100 successful blogs?

Introduction to Compilers and Language Design (2021)

You need a webring

Run Windows 2000 on a DEC Alpha with a new es40 fork

Installing A/UX 1.1 like it's the 90s

Mr. Baby Paint and accidentally discovering a new cellular automata

Airplane Boneyards List and Map

Why DMARC's new "NP" tag can fail with DNSSEC

Small Penis Rule

Shadcn/UI now defaults to Base UI instead of Radix

Taphonomic analysis reveals behavioral & tech capabilities of Homo floresiensis

A sociotechnical threat model for AI-driven smart home devices

Jim Keller's startup is building a factory to mass-produce small chip fabs

Optimizing an algorithm that's quadratic by design

OpenWiki: CLI that writes and maintains agent documentation for your codebase

Medieval-style fortifications are back in the Sahel

Every postcard tells a story

Show HN: KiCad in the Browser

The GNU Emacs Architecture: Unlocking the Core [pdf]

Web-based cryptography is always snake oil

Pandoc Lua Filters

Autonomous flying umbrella follows and shields users from rain and sunlight

EU Council forces Chat Control via fast-track

Rayfish, Peer-to-peer mesh VPN with no server to trust

New AI tutor achieves 0.71-1.30 SD effect size in Dartmouth course [pdf]

Comments

Organic Maps

New AI tutor achieves 0.71-1.30 SD effect size in Dartmouth course [pdf]

Starring the Computer

The future of Flipper Zero development

Do you hate XML? (2010)

It's not about physical vs. digital games, it's about ownership

The great blogging collapse: What happened to 100 successful blogs?

Introduction to Compilers and Language Design (2021)

You need a webring

Run Windows 2000 on a DEC Alpha with a new es40 fork

Installing A/UX 1.1 like it's the 90s

Mr. Baby Paint and accidentally discovering a new cellular automata

Airplane Boneyards List and Map

Why DMARC's new "NP" tag can fail with DNSSEC

Small Penis Rule

Shadcn/UI now defaults to Base UI instead of Radix

Taphonomic analysis reveals behavioral & tech capabilities of Homo floresiensis

A sociotechnical threat model for AI-driven smart home devices

Jim Keller's startup is building a factory to mass-produce small chip fabs

Optimizing an algorithm that's quadratic by design

OpenWiki: CLI that writes and maintains agent documentation for your codebase

Medieval-style fortifications are back in the Sahel

Every postcard tells a story

Show HN: KiCad in the Browser

The GNU Emacs Architecture: Unlocking the Core [pdf]

Web-based cryptography is always snake oil

Pandoc Lua Filters

Autonomous flying umbrella follows and shields users from rain and sunlight

EU Council forces Chat Control via fast-track

Rayfish, Peer-to-peer mesh VPN with no server to trust