LLM Comparison Matrix

Ratings are directional, not absolute. Use 1-5 ranking to compare faster, then sort by what matters most for your product.

Model Overall Reasoning Coding Cost Efficiency Latency Context Quality Deployment Control
GPT-4.1 4.0/5★★★★☆ 5/5★★★★★ 5/5★★★★★ 3/5★★★☆☆ 3/5★★★☆☆ 4/5★★★★☆ 2/5★★☆☆☆
o3-mini 3.8/5★★★★☆ 5/5★★★★★ 4/5★★★★☆ 4/5★★★★☆ 4/5★★★★☆ 3/5★★★☆☆ 2/5★★☆☆☆
Claude 3.7 Sonnet 3.8/5★★★★☆ 5/5★★★★★ 4/5★★★★☆ 3/5★★★☆☆ 3/5★★★☆☆ 5/5★★★★★ 2/5★★☆☆☆
Claude 3.5 Haiku 3.5/5★★★★☆ 4/5★★★★☆ 3/5★★★☆☆ 4/5★★★★☆ 5/5★★★★★ 3/5★★★☆☆ 2/5★★☆☆☆
Gemini 2.0 Pro 3.5/5★★★★☆ 4/5★★★★☆ 4/5★★★★☆ 3/5★★★☆☆ 4/5★★★★☆ 4/5★★★★☆ 2/5★★☆☆☆
Gemini 2.0 Flash 3.3/5★★★☆☆ 3/5★★★☆☆ 3/5★★★☆☆ 4/5★★★★☆ 5/5★★★★★ 3/5★★★☆☆ 2/5★★☆☆☆
Llama 3.1 70B Instruct 4.0/5★★★★☆ 4/5★★★★☆ 4/5★★★★☆ 4/5★★★★☆ 4/5★★★★☆ 3/5★★★☆☆ 5/5★★★★★
Mixtral 8x22B 3.8/5★★★★☆ 4/5★★★★☆ 4/5★★★★☆ 4/5★★★★☆ 4/5★★★★☆ 3/5★★★☆☆ 4/5★★★★☆
Mistral Large 3.5/5★★★★☆ 4/5★★★★☆ 4/5★★★★☆ 3/5★★★☆☆ 4/5★★★★☆ 4/5★★★★☆ 3/5★★★☆☆
Qwen2.5 72B Instruct 3.8/5★★★★☆ 4/5★★★★☆ 4/5★★★★☆ 5/5★★★★★ 4/5★★★★☆ 3/5★★★☆☆ 4/5★★★★☆
DeepSeek V3 3.8/5★★★★☆ 4/5★★★★☆ 4/5★★★★☆ 5/5★★★★★ 4/5★★★★☆ 3/5★★★☆☆ 4/5★★★★☆
Phi-3 Medium 3.5/5★★★★☆ 3/5★★★☆☆ 3/5★★★☆☆ 5/5★★★★★ 5/5★★★★★ 2/5★★☆☆☆ 4/5★★★★☆

Score key: 5 = excellent, 4 = strong, 3 = medium, 2 = low.

Interpreting the Matrix

For customer-facing quality, prioritize reasoning + context quality. For internal automation at scale, prioritize cost efficiency + latency.