LLM Comparison Matrix

Ratings are directional, not absolute. Updated April 12, 2026 with latest releases and pricing changes. Use 1-5 ranking to compare faster, then sort by what matters most for your product.

Latest matrix includes: GPT-5 Turbo (web access + video), Claude 4.5 Sonnet (2M tokens, improved coding), o3 (native multimodal), Grok-3 (emerging), Llama 4.1 Scout (25% faster), Llama 4.2 Adventurer (visual), DeepSeek R1.5 (improved reasoning), Mistral Ultra (enterprise), Gemma 4 4B/12B/27B tracks (open-weight), and Gemini 2.5 Ultra. All models compared on latency, accuracy, cost per 1M tokens, multimodal capability, and deployment control.

Sort by

Model	Overall	Reasoning	Coding	Cost Efficiency	Latency	Context Quality	Deployment Control
GPT-5	4.0/5★★★★☆	5/5★★★★★	5/5★★★★★	3/5★★★☆☆	3/5★★★☆☆	4/5★★★★☆	2/5★★☆☆☆
GPT-5 mini	3.9/5★★★★☆	4/5★★★★☆	4/5★★★★☆	4/5★★★★☆	5/5★★★★★	4/5★★★★☆	2/5★★☆☆☆
o3	3.8/5★★★★☆	5/5★★★★★	4/5★★★★☆	4/5★★★★☆	4/5★★★★☆	3/5★★★☆☆	2/5★★☆☆☆
Claude 4.5 Sonnet	3.8/5★★★★☆	5/5★★★★★	4/5★★★★☆	3/5★★★☆☆	3/5★★★☆☆	5/5★★★★★	2/5★★☆☆☆
Claude 4 Haiku	3.5/5★★★★☆	4/5★★★★☆	3/5★★★☆☆	4/5★★★★☆	5/5★★★★★	3/5★★★☆☆	2/5★★☆☆☆
Gemini 2.5 Pro	3.5/5★★★★☆	4/5★★★★☆	4/5★★★★☆	3/5★★★☆☆	4/5★★★★☆	4/5★★★★☆	2/5★★☆☆☆
Gemini 2.5 Flash	3.3/5★★★☆☆	3/5★★★☆☆	3/5★★★☆☆	4/5★★★★☆	5/5★★★★★	3/5★★★☆☆	2/5★★☆☆☆
Gemma 4 12B	3.9/5★★★★☆	4/5★★★★☆	4/5★★★★☆	5/5★★★★★	5/5★★★★★	3/5★★★☆☆	5/5★★★★★
Gemma 4 27B	4.1/5★★★★☆	4/5★★★★☆	4/5★★★★☆	5/5★★★★★	5/5★★★★★	3/5★★★☆☆	5/5★★★★★
Llama 4 Maverick	4.3/5★★★★☆	5/5★★★★★	4/5★★★★☆	4/5★★★★☆	4/5★★★★☆	4/5★★★★☆	5/5★★★★★
Llama 4 Scout	4.2/5★★★★☆	4/5★★★★☆	3/5★★★☆☆	5/5★★★★★	5/5★★★★★	3/5★★★☆☆	5/5★★★★★
Llama 3.1 70B Instruct	4.0/5★★★★☆	4/5★★★★☆	4/5★★★★☆	4/5★★★★☆	4/5★★★★☆	3/5★★★☆☆	5/5★★★★★
Mixtral 8x22B	3.8/5★★★★☆	4/5★★★★☆	4/5★★★★☆	4/5★★★★☆	4/5★★★★☆	3/5★★★☆☆	4/5★★★★☆
Mistral Large 2	3.5/5★★★★☆	4/5★★★★☆	4/5★★★★☆	3/5★★★☆☆	4/5★★★★☆	4/5★★★★☆	3/5★★★☆☆
Qwen3 32B Instruct	3.8/5★★★★☆	4/5★★★★☆	4/5★★★★☆	5/5★★★★★	4/5★★★★☆	3/5★★★☆☆	4/5★★★★☆
DeepSeek R1	3.8/5★★★★☆	5/5★★★★★	4/5★★★★☆	4/5★★★★☆	5/5★★★★★	4/5★★★★☆	4/5★★★★☆

Score key: 5 = excellent, 4 = strong, 3 = medium, 2 = low.

Interpreting the Matrix

For customer-facing quality, prioritize reasoning + context quality. For internal automation at scale, prioritize cost efficiency + latency.