How it works:
- player first sees a "warmup" question; this is just a prompt for a future date; I collect answers for the question to feed into an LLM to generate the answers for it
- player moves onto actual game, trying to guess the most common answers; every user submission goes to the LLM to determine how similar it is to the predefined buckets/answers
Some interesting learnings while building this:
- the LLMs do a pretty decent first pass (both in creating the answers and judging them), however the last 20% of work is some serious fine-tuning to avoid hallucinations, strange inconsistencies, etc
- the answer creation LLM has a tough job; it has to take the responses and create workable buckets that (a) aren't too broad and (b) are different from each each, which is surprisingly challenging; It employs pair-wise cosine similarity (how similar two vectors points are to each other) and Jaccard similarity (how similar two sets of data are to each other); still lots of work to be done here as I still see buckets that are too encompassing and share sizable overlap with other buckets
- the judging LLM has answer normalization rules (e.g. plural -> singular, strip special characters, handle typos via Levenshtein distance, etc) and matching logic using cosine similarity plus determining if a guess is hypernym or hyponym in relation to the bucket -> we want answers to be more specific (e.g. guess == "truck", bucket == "vehicle" GOOD, guess == "vehicle", bucket == "truck" BAD)
Let me know if you have any questions or feedback!