Built a quick experiment to see how AI models differ in their judgment of writing quality. Fed the same Medium article titles to both GPT-4o-mini and GPT-4o to see how their rankings would compare.
The interesting bit isn't just the rankings themselves, but how the models diverge in their evaluation criteria - the "mini" model seems to have subtly different preferences despite being from the same family. Code is included (Python scraper + API calls), along with full logs showing the ranking rationale from each model.
Started this as a voice-dictated script on a cold walk home. Sometimes the best experiments come from random "what if" thoughts.
hadiai•2h ago