Just because the model mentions gender, it doesn't mean the decision was made because of gender and not taarof. This is the classic mistake of personifying LLMs. You can't trust what the LLM says it's thinking as what is actually happening. It's not actually an entity talking.
I'm surprised human benchmark is that low. The canonical example of taarof, one I've seen elsewhere, is of a taxi driver insisting that a ride is free while expecting to get paid. Taarof in this case is load-bearing for the transaction. I presume humans only get th edge cases wrong.
As an aside, there are elements of this sort of thing in Bay Area tech culture too. Something that drives me nuts is someone writing on a code review "you may want to consider using the X data struct here" and meaning "I will not merge this code until you use X". I can only imagine taarof irks more literal-minded Persian speakers for the same reason.
pinkmuffinere•3h ago
metalman•48m ago