I have created my own original large-scale model evaluation dataset with 18 major dimensions, nearly 100 minor dimensions, and a total of 970 questions. The following are the test results:
1. Software Engineering and Code Generation: GPT-5.3 codex
2. Code Comprehension, Reasoning, and Quality: GPT-5.3 codex
3. Debugging, Testing, and Maintenance: GPT-5.3 codex
4. Data Engineering and Backend Services: Claude Opus 4.6
5. Frontend and Product Engineering: Claude Opus 4.6
6. Agent Tool Invocation: Claude Opus 4.6
7. Web and Desktop Automation (Static): Claude Opus 4.6
8. Research and Knowledge Work Agent (Static): GPT-5.2 Pro
9. Mathematical and Formal Reasoning: Gemini 3.1 Pro
10. Logic and Planning: Gemini 3.1 Pro
11. Knowledge Breadth and Fact Verification: Gemini DeepThink
12. Reading Comprehension and Information Extraction: GPT-5.2 Thinking
13. Long Contextual Memory and Multi-turn Consistency: GPT-5.2 Thinking
14. Instruction Compliance and Alignment: Claude Opus 4.6
15. Multimodal Understanding and Visual Reasoning: GPT-5.2 Thinking
16. Emotional Intelligence and Collaborative Communication: GPT-4.5
17. Creative Expression and Aesthetics: Claude Opus 4.6
Li_Evan•2h ago