Question 1

What is the best LLM for most products?

Accepted Answer

Start with GPT-5 Turbo (now with live web data) or Claude 4.5 Sonnet (2M tokens) for quality-critical work, then add Gemini 2.5 Flash, Gemma 4, or Llama 4.1 Scout for volume. Recent pricing cuts make this two-tier approach economical even at small scale.

Question 2

Should we self-host open models?

Accepted Answer

Self-hosting is increasingly practical with new models like Gemma 4 and Llama 4.1 Scout (25% faster than prior versions). Consider it when data residency, cost at very high scale, or latency control justify your operational investment. New pricing cuts from cloud providers make the ROI analysis more nuanced.

Question 3

How many models should we run in production?

Accepted Answer

Start with one primary model and one fallback (cost-optimized). Pricing cuts and performance improvements now justify adding a third tier for specific domains (e.g., coding, multimodal). But don't exceed 3-4 total; operational complexity grows faster than value.

Question 4

Should we use models with live web access?

Accepted Answer

GPT-5 Turbo now includes real-time web data retrieval. This is valuable for research, current events, and compliance workflows. Trade off: slightly higher latency and cost. Recommended for at least your premium tier in 2026.

Question 5

How should we adapt to recent pricing cuts?

Accepted Answer

40% cuts on GPT-4 mini routes and 35% on Claude Haiku make cost-optimization more granular. Reconsider routing strategies: cheaper tiers now handle more workloads. Reassess eval criteria with newer models like Gemma 4 and Llama 4.1 Scout (25% faster, same quality).

Frequently Asked Questions

What is the best LLM for most products?

Should we self-host open models?

How many models should we run?

Can one model fit every workload?