Phare focuses on factual reliability, prompt sensitivity, multilingual support, and how models handle false premises like issues that actually matter when you're building serious applications.
Some insights:
- Preference scores ≠ factual correctness.
- Framing effects can cause models to miss obvious falsehoods.
- Safety metrics like sycophancy and stereotype reproduction show surprising results across popular models.
Would love feedback from the community.