LLM Comparison Matrix
Ratings are directional, not absolute. Use 1-5 ranking to compare
faster, then sort by what matters most for your product.
| Model |
Overall |
Reasoning |
Coding |
Cost Efficiency |
Latency |
Context Quality |
Deployment Control |
| GPT-4.1 |
4.0/5★★★★☆
|
5/5★★★★★
|
5/5★★★★★
|
3/5★★★☆☆
|
3/5★★★☆☆
|
4/5★★★★☆
|
2/5★★☆☆☆
|
| o3-mini |
3.8/5★★★★☆
|
5/5★★★★★
|
4/5★★★★☆
|
4/5★★★★☆
|
4/5★★★★☆
|
3/5★★★☆☆
|
2/5★★☆☆☆
|
| Claude 3.7 Sonnet |
3.8/5★★★★☆
|
5/5★★★★★
|
4/5★★★★☆
|
3/5★★★☆☆
|
3/5★★★☆☆
|
5/5★★★★★
|
2/5★★☆☆☆
|
| Claude 3.5 Haiku |
3.5/5★★★★☆
|
4/5★★★★☆
|
3/5★★★☆☆
|
4/5★★★★☆
|
5/5★★★★★
|
3/5★★★☆☆
|
2/5★★☆☆☆
|
| Gemini 2.0 Pro |
3.5/5★★★★☆
|
4/5★★★★☆
|
4/5★★★★☆
|
3/5★★★☆☆
|
4/5★★★★☆
|
4/5★★★★☆
|
2/5★★☆☆☆
|
| Gemini 2.0 Flash |
3.3/5★★★☆☆
|
3/5★★★☆☆
|
3/5★★★☆☆
|
4/5★★★★☆
|
5/5★★★★★
|
3/5★★★☆☆
|
2/5★★☆☆☆
|
| Llama 3.1 70B Instruct |
4.0/5★★★★☆
|
4/5★★★★☆
|
4/5★★★★☆
|
4/5★★★★☆
|
4/5★★★★☆
|
3/5★★★☆☆
|
5/5★★★★★
|
| Mixtral 8x22B |
3.8/5★★★★☆
|
4/5★★★★☆
|
4/5★★★★☆
|
4/5★★★★☆
|
4/5★★★★☆
|
3/5★★★☆☆
|
4/5★★★★☆
|
| Mistral Large |
3.5/5★★★★☆
|
4/5★★★★☆
|
4/5★★★★☆
|
3/5★★★☆☆
|
4/5★★★★☆
|
4/5★★★★☆
|
3/5★★★☆☆
|
| Qwen2.5 72B Instruct |
3.8/5★★★★☆
|
4/5★★★★☆
|
4/5★★★★☆
|
5/5★★★★★
|
4/5★★★★☆
|
3/5★★★☆☆
|
4/5★★★★☆
|
| DeepSeek V3 |
3.8/5★★★★☆
|
4/5★★★★☆
|
4/5★★★★☆
|
5/5★★★★★
|
4/5★★★★☆
|
3/5★★★☆☆
|
4/5★★★★☆
|
| Phi-3 Medium |
3.5/5★★★★☆
|
3/5★★★☆☆
|
3/5★★★☆☆
|
5/5★★★★★
|
5/5★★★★★
|
2/5★★☆☆☆
|
4/5★★★★☆
|
Score key: 5 = excellent, 4 = strong, 3 = medium, 2 = low.
Interpreting the Matrix
For customer-facing quality, prioritize reasoning + context quality.
For internal automation at scale, prioritize cost efficiency +
latency.