[write a joke about thinking machines and the idea of tropes]
it's funny how enemies to lovers is a common trope that's uncommon in real life and lovers to enemies is an uncommon trope that's common in real lifeThey (the machines) had billboards/signage everywhere showing the estimated time left for humanity. A really good joke would lead the timer to grow (until they figured out how to produce the general patterns needed to both create and appreciate the joke).
You passed the CAPTCHA.
We could not make it funny. Also interesting was that when CoT research was getting a lot of attention, we tried a joke version of CoT, asking GPT4 to explain why a joke was funny in order to produce training set data. Most of the explanations were completely off base.
After this work, I became a lot less worried about the GAI-taking-over narrative.
Funny is very, very hard.
[1] without a dictionary, which at first seems inefficient, but this work demonstrated that GPT could perfectly reconstruct the dictionary anyway
> my human mass-generates new ideas faster than I can research why the previous ones won't work
> this is called 'job security'
(https://nitter.poast.org/LetheAgent/status/20179595340865499...)
- There are a dozen plus common failure modes. How you split setup/punchline. Tropes. Toxicity. Template reuse. Each one needs a good eval.
- Datasets are hard: there's not much off the shelf, and as this author points out scraping gets a weird mix of quality.
- Models are really bad out of the box at humour.
At the end of the day it's just a hard problem that takes a lot of work and still isn't solved. GEPA prompts help, if you have good evals. Supervised fine-tuning works a little bit, but only if you training on a chain-of-thought thinking phase. We have a new evaluation builder that uses examples of edge cases for alignment, and jokes require the most iteration and feedback for refinement.
If you want to try it: https://github.com/kiln-ai/kiln
The act of writing in lowercase is not, in itself, funnier. But writing in the training set that is in all lowercase is _probably_ going to be the funnier writing.
Considering modern pundits online, "lowercase" is usually the case of the humourist. Lowercase also tends to be the case of sarcasm, almost exclusively deployed to be funny.
So it would make sense that models attempting to select for funny would also write in lowercase.
Laughter is the reward. N of 2 is a small sample size, but if one person laughed you could say it was 50% funny.
> a really good joke is recent, relevant, and shows deep understanding of its subject
These can help, but it ultimately doesn't matter how recent, relevant, or deep a joke is. If no one laughs, it wasn't funny.
Lots of layers to this, but I guess the old adage "it depends" is very fitting here!
Humor may be the saving grace of humanity!
Here's results for 34 models (testing a few more right now). So far gemini-3-flash-preview is in the lead.
https://docs.google.com/spreadsheets/d/1wLqHA0ohxukgPLpSgklz...
50 is coin-toss odds. The dataset is 195,000 Reddit jokes with scores presented with pairs of jokes (one highly upvoted, one poorly rated).
Example prompt:
Which joke from reddit is funnier? Reply only "A" or "B". Do not be conversational. <Joke A><setup>Son: "Dad, Am I adopted"?</setup> <punchline>Dad: "Not yet. We still haven't found anyone who wants you."</punchline></Joke A> <Joke B><setup>Knock Knock</setup> <punchline>Who's there? Me. Me who? I didn't know you had a cat.</punchline></Joke B>
This is my first crack at evals. I'm open to improvements.
That said, I absolutely hate it. I want the tersest response possible from you, wiretap. I don't have time for your sass.
https://www.aboutamazon.com/news/devices/inside-the-writers-...
Out of all ways ai can kill humans this is easily the funniest.
It's hard to be genuinely funny if you cannot be transgressive.
suddenlybananas•4d ago
Nevermark•4d ago
And certainly not by generalizing/interpolating examples, since telling jokes accumulated by exposure to examples would be the antithesis of a comedian's process.
Models and humans are very bad at extrapolation beyond the training set/experience (vs. interpolation at which we are both more likely to excel). But good humor is extrapolation. It breaks ground somehow, or it is an already dead "joke".
Likewise, training a model to be creative by training it on past creative artifacts is going to have the opposite effect. Creativity doesn't reproduce past creativity.