that begs the question, how does the goose know it has understood?
that’s when I thought of an understanding bar - always available to the user to help visualize how much the goose understands you, 0 -> 100%.
the original logic powering the understanding bar went something like this: every turn, id send the convo to an llm and ask it to return a number 0-100 , with a rubric of brackets to make the output less volatile. 0-10 meant no real understanding. 11-20 named , but empty. 21-35 meant a partial understanding, and so on, up to 93-100 for the goose understanding your topic exceptionally. this approach worked. mostly. until I started looking at what came back once real users tested the goose.
two testers were explaining the basic way a cpu works. the first used textbook style definition, (fetch, decode , execute etc) and got a final understanding of 87% after a couple turns. the second used a real world example of a chef, linking it to concepts of a cpu. same level of understanding, expressed differently. the second tester got a score of 36. id built the opposite of what I wanted, a tutor rewarding parroting.
checking into the data to find the source of the variances I noticed if I put the same paragraph verbatim in, and got 5 varying scores out: 51,66,51,70,51. the brackets kind of stabilized the results, but the score was unexplainable. why 66 and not 70? nothing in the system could tell me, the limit just picked.
the fix was to stop adding the model to be the math , and make a new system. now every session gets a ‘flight plan’ when the session has a meaningful topic. a separate llm call generates 3-4 essential subconcepts a real explanation must cover. eg for photosynthesis: what it uses, what it produces, why plants need it. each turn the goose’s evaluator returns discrete depth updates per waypoint (0-3, from not addressed, named, stated, explained in own words), plus any misconceptions which were spotted. Javascript makes sure depth only moves up (like a ratchet), weighted coverage, the gate to finish(wrap) a session, and the flow to repair a misconception.
what if the user introduces a subtopic the the plan didn’t anticipate?
in that case, the system decides whether to amend the plan mid session, with a backfill evaluation to credit prior turns. i also added 5 levels of intelligence to the goose, (breezy to razor sharp) which each make the model judge objective depth, then code decides what’s enough. the same chef analogy now scores 87, because the evaluation prompt explicitly tells the llm the waypoints ideal answer is just a valid framing, not the only one.
to validate these changes, I sat down and acted as 15 different types of users, typing differently explaining differently etc, then made changes based on response and iterated. a little bug I found was the llm evaluator giving credit to the wrong actor - the goose teaching via analogy and the student getting credit for it, fixed that too.
lesson worth keeping: if you build anything an llm needs to rate or rank by number, don’t trust it, give it something discrete, not subjective, otherwise they will fake and hallucinate.
professor goose is live if you want to try it!
anitroves•34m ago
zapseo•28m ago