AI Product · 13 min read
LLM inference costs that feel trivial at prototype scale ($50/month at 10K requests) become existential at product scale ($50,000/month at 10M requests) unless you have a cost architecture from day one. The four levers that control LLM costs are: model selection (smaller models for simpler tasks), caching (never pay for the same answer twice), prompt optimization (fewer tokens = lower cost), and batching (async use cases don't need real-time latency pricing). Teams that implement all four typically achieve 5-10x cost reduction vs a naive always-use-the-best-model approach. This isn't premature optimisation — it's what separates AI features with healthy unit economics from those that quietly kill margins.
At prototype stage, LLM costs feel invisible. You're calling GPT-4o or Claude Sonnet, it costs a few dollars, it works well, everyone's happy. The problem is that prototype usage patterns are nothing like production usage patterns. In a prototype, an engineer makes 100 API calls while testing. In production, 10,000 users each make 10 API calls per session — that's 100,000 calls per day, 3 million per month. At $0.005 per API call (roughly GPT-4o pricing for a 1K token request), that's $15,000/month before you've added any smart cost management. Scale to 100,000 users and it's $150,000/month — a meaningful line item that changes the unit economics of your product entirely.
The teams that don't feel this until it's a crisis are those that never built cost visibility into their AI feature infrastructure. Track cost per feature interaction from day one. Set alerts when cost per interaction exceeds target. Understand which use cases are most expensive and which users are heavy consumers. Without this visibility, you're flying blind into a cost problem that will eventually demand emergency refactoring of your entire AI architecture.
The single largest cost reduction opportunity is almost always model selection. GPT-4o, Claude Sonnet, and Gemini 1.5 Pro are powerful general-purpose models that cost 5-20x more than their smaller siblings (GPT-4o Mini, Claude Haiku, Gemini Flash). For the majority of production AI tasks — summarisation, classification, extraction, templated generation — the small models perform at 85-95% of the quality of the large models at a fraction of the cost. Only genuinely complex reasoning tasks — multi-step analysis, nuanced creative writing, complex code generation — justify large model pricing.
Build a task classifier that routes requests to the appropriate model tier. Simple tasks (intent detection, entity extraction, short content generation): route to the small model tier. Medium tasks (document summarisation, structured data generation, single-step reasoning): medium models. Complex tasks (multi-document analysis, agentic workflows, high-stakes output): large models. This routing architecture, implemented well, typically reduces blended model cost by 50-70% with minimal quality degradation perceptible to end users.
| Model Tier | Examples | Relative Cost | Best For |
|---|---|---|---|
| Small | GPT-4o Mini, Claude Haiku, Gemini Flash | 1x | Classification, extraction, short generation |
| Medium | GPT-4o, Claude Sonnet, Gemini 1.5 Pro | 5-15x | Summarisation, structured output, RAG responses |
| Large | GPT-o1, Claude Opus, Gemini 1.5 Ultra | 20-50x | Complex reasoning, agentic tasks, code review |
Two types of caching dramatically reduce LLM costs. Exact caching stores the response to a specific prompt hash — if the exact same prompt is sent again, return the cached response without hitting the API. This works well for templated prompts where the same structure is sent with identical parameters (e.g., "Summarise this product description: [PRODUCT X]" where PRODUCT X repeats). Semantic caching stores responses with embeddings and returns cached responses for semantically similar (not necessarily identical) queries. A user asking "What's the return policy?" and another asking "How do I return a product?" receive the same cached answer. Tools like GPTCache, LangChain's caching layer, and Redis with vector search can implement semantic caching. Hit rates of 40-70% are achievable for products with repeated query patterns (support chatbots, FAQ systems, content generation with templates).
Anthropic's prompt caching feature (available on the API) is specifically valuable for long system prompts or large context windows that are reused across requests — the API charges a reduced rate for cache-hit input tokens. For products with a fixed large system prompt (e.g., a customer support bot with a 5,000-token knowledge base in the system prompt), enabling prompt caching can reduce input token costs by 80-90%.
LLM APIs charge by token — every word in your prompt costs money. Prompt inflation is common in products that grow organically: prompts accumulate instructions, examples, and context over time without anyone auditing for redundancy. A systematic prompt audit typically finds 20-40% token reduction opportunity with no quality impact. Techniques: remove redundant instructions (if you've said "be concise" 3 times in different ways, say it once), compress verbose examples (few-shot examples can often be shortened without losing their instructive value), use shorter system identifiers ("You are a support agent" instead of "You are a highly experienced customer support agent with expertise in..."), and move infrequently-needed instructions to conditional prompting (only add regulatory disclaimer instructions when the query triggers a compliance classifier).
Output token optimisation: by default, LLMs often generate more tokens than necessary for the response. Explicit length constraints in the prompt ("respond in 2-3 sentences", "output valid JSON only, no explanation") reduce output tokens by 30-50% for many use cases. Setting `max_tokens` to a reasonable ceiling prevents runaway generation in edge cases.
Not every LLM call needs sub-second latency. For use cases where the user isn't waiting in real-time for the response — background content generation, periodic report creation, email draft preparation, data enrichment pipelines — batch processing reduces cost by 30-50%. Anthropic's Batch API and OpenAI's Batch API both offer discounts for async batched requests that are processed within a 24-hour window. If you have overnight report generation, content enrichment jobs, or any AI processing that doesn't need to be synchronous with user interaction, batch processing is low-hanging fruit.
Fine-tuning a foundation model on your own data can reduce costs (a smaller fine-tuned model can match the quality of a larger general model on your specific task) but requires a significant up-front investment in data preparation, training runs, and ongoing maintenance. The break-even calculation: if your monthly API spend exceeds $5,000-10,000 on a specific high-volume task, fine-tuning becomes worth evaluating. For most teams, the sequence is: (1) start with foundation model API calls, (2) optimize with the 4 levers above, (3) only consider fine-tuning when volume is high and the task is narrow enough that a smaller specialized model would clearly outperform a general one on your use case.
Three metrics to track weekly: cost per session (total LLM spend ÷ number of active sessions — should be stable or declining as you optimise), cost as a percentage of revenue (if you're charging for an AI-powered product, LLM cost should be a manageable percentage of what you charge — typically targeting under 20%), and cost per user tier (power users may consume 10-50x the LLM resources of light users — understand this distribution before pricing your tiers). If you can answer these three questions confidently, your cost architecture is healthy. If you can't, you need observability infrastructure before you scale further.
We help product teams design cost-efficient AI architectures — from model selection to caching to prompt strategy. Book a free session.
Book Free Strategy Call