LLM Benchmarks: A Comprehensive Guide to AI Model Evaluation
Model General (MMLU) Code (HumanEval) Math (MATH) Reasoning (GPQA) Multilingual (MGSM) Tool Use (BFCL) Grade School Math (GSM8K) Claude 3.5 Sonnet 88.3% 92.0% 71.1% 59.4% 91.6% 90.2% 96.4% GPT-4o 88.7% 90.2% 76.6% 53.6% 90.5% 83.6% 96.