Show HN: I fixed my AI goose tutor to stop punishing understanding

https://professorgoose.com/?version=2.0

3•zapseo•48m ago

a few weeks ago, I built professor goose, a socratic ai tutor built around the feynman idea that if you can’t explain it, you don’t understand it. the way professor goose works specifically is you pick a topic, rubber duck at a 3d goose, and instead of answering it asks follow up questions until it understands.

that begs the question, how does the goose know it has understood?

that’s when I thought of an understanding bar - always available to the user to help visualize how much the goose understands you, 0 -> 100%.

the original logic powering the understanding bar went something like this: every turn, id send the convo to an llm and ask it to return a number 0-100 , with a rubric of brackets to make the output less volatile. 0-10 meant no real understanding. 11-20 named , but empty. 21-35 meant a partial understanding, and so on, up to 93-100 for the goose understanding your topic exceptionally. this approach worked. mostly. until I started looking at what came back once real users tested the goose.

two testers were explaining the basic way a cpu works. the first used textbook style definition, (fetch, decode , execute etc) and got a final understanding of 87% after a couple turns. the second used a real world example of a chef, linking it to concepts of a cpu. same level of understanding, expressed differently. the second tester got a score of 36. id built the opposite of what I wanted, a tutor rewarding parroting.

checking into the data to find the source of the variances I noticed if I put the same paragraph verbatim in, and got 5 varying scores out: 51,66,51,70,51. the brackets kind of stabilized the results, but the score was unexplainable. why 66 and not 70? nothing in the system could tell me, the limit just picked.

the fix was to stop adding the model to be the math , and make a new system. now every session gets a ‘flight plan’ when the session has a meaningful topic. a separate llm call generates 3-4 essential subconcepts a real explanation must cover. eg for photosynthesis: what it uses, what it produces, why plants need it. each turn the goose’s evaluator returns discrete depth updates per waypoint (0-3, from not addressed, named, stated, explained in own words), plus any misconceptions which were spotted. Javascript makes sure depth only moves up (like a ratchet), weighted coverage, the gate to finish(wrap) a session, and the flow to repair a misconception.

what if the user introduces a subtopic the the plan didn’t anticipate?

in that case, the system decides whether to amend the plan mid session, with a backfill evaluation to credit prior turns. i also added 5 levels of intelligence to the goose, (breezy to razor sharp) which each make the model judge objective depth, then code decides what’s enough. the same chef analogy now scores 87, because the evaluation prompt explicitly tells the llm the waypoints ideal answer is just a valid framing, not the only one.

to validate these changes, I sat down and acted as 15 different types of users, typing differently explaining differently etc, then made changes based on response and iterated. a little bug I found was the llm evaluator giving credit to the wrong actor - the goose teaching via analogy and the student getting credit for it, fixed that too.

lesson worth keeping: if you build anything an llm needs to rate or rank by number, don’t trust it, give it something discrete, not subjective, otherwise they will fake and hallucinate.

professor goose is live if you want to try it!

Comments

anitroves•34m ago

this professor goose of yours is lagging too much and there is no keboard writing area, you have to do voice chat which is not comfortabble for many people

zapseo•28m ago

good feedback, thanks for trying it. I’m working on optimization but with a 3d rendered goose it’s tough going. there is a voice/text toggle just above the mic, just tap “text” and you’ll switch. also available in settings. just made a UI tweak to make the toggle more obvious because clearly it wasn’t.

C++26: More Function Wrappers

Navigation Systems in Video Games

Google Cloud suspended major customer Railway.com without cause, causing outage

How to style a Hugo Atom feed with XSL

GDS weighs in on the NHS's decision to retreat from Open Source

Show HN: Open-Source Agentic QA Harness with Memory

Prompts are technical debt too

A 4K-year-old city became more equal as it became more successful

Show HN: I built a dashboard to combine my WHOOP, Oura and Apple Health data

Show HN: I put Codex and Claude in a tank arena; Codex is winning 55% so far

The missing men of the American marriage market

Launched Chronl – a daily history puzzle (Garfield, Berlin Wall, etc.)

NTSB Analysis of MD-11F Engine Pylon Failure

Ask HN: Second time my post gets [dead] within a minute

Show HN: AI Doctor Notes – Private doctor visit notes app for patients

Show HN: Mashari – a dashboard to manage your projects

The Download: Musk vs. Altman week 3, and Trump's tech trading

The Weekly Challenge 374

Gemini 3.5 Flash hax 14x cost multiplier in GitHub Copilot

The Broken Basic Years

My Professional Motivations, Qualifications, and Role Preferences

Swatch Internet Time

Bluey Minisodes, Episode 1 – Burger Dog

Google's AI is being manipulated. The search giant is quietly fighting back

How many sandboxed pods can fit in a Pi?

Let's Learn Hangul

How did CMD-K come to be the standard shortcut for opening a command palette?

DMARC is now an IETF Proposed Standard: what's new in RFCs 9989–9991

Map of Metal

Grokipedia selectively drawing on more right-leaning news sources, new study