Author here. I've always been fascinated by the Sorites Paradox (at what point does a pile of sand become a heap?), so I decided to run an experiment to see how different LLMs handle vague predicates.
I didn't just want a text answer, so I measured the probability logits for "Yes/No" tokens across pile sizes ranging from 1 to 100M grains.
Key takeaways:
1. Prompting "Is this a heap?" directly is useless (the model just agrees with your framing).
2. Few-shot prompting creates a fascinating sigmoid "heapness curve" for most models (Mistral, DeepSeek).
3. Llama-3-8B was the outlier—it remained perpetually uncertain (probs ~0.35-0.55) across almost the entire range. I argue this is actually the most "philosophically honest" reflection of how humans use the word.
I have a feeling that there is an optimal prompt for this type of experiment, but struggle to find it, or even know if I have found it. The charts in the post are rendered in-browser using the data points I collected. Curious to hear your thoughts :)
vuciv•1h ago
I didn't just want a text answer, so I measured the probability logits for "Yes/No" tokens across pile sizes ranging from 1 to 100M grains.
Key takeaways: 1. Prompting "Is this a heap?" directly is useless (the model just agrees with your framing). 2. Few-shot prompting creates a fascinating sigmoid "heapness curve" for most models (Mistral, DeepSeek). 3. Llama-3-8B was the outlier—it remained perpetually uncertain (probs ~0.35-0.55) across almost the entire range. I argue this is actually the most "philosophically honest" reflection of how humans use the word.
I have a feeling that there is an optimal prompt for this type of experiment, but struggle to find it, or even know if I have found it. The charts in the post are rendered in-browser using the data points I collected. Curious to hear your thoughts :)