Skip to content

AI Cost Engineering

The Complete Guide to AI Token Pricing and Cost Optimization (2026)

Published April 20, 2026 · 14 min read · By FastTool Editorial

The billing surprise always looks the same. A team ships a feature that summarizes support tickets. It worked in staging. The first monthly invoice shows a four-figure line that nobody forecasted. The prompt was small, the output was small, and the product usage was linear, but somewhere along the way the tokens multiplied and nobody was counting.

This guide is for anyone who has been in that room, or is trying to avoid it. We cover what a token actually is, what GPT-5, Claude Opus 4.5, Gemini 2.5 Pro, and Llama 3.3 charge for one as of April 2026, how prompt caching changes the math, and how to run the numbers on a real workload before you ship it. We cite the primary sources — OpenAI pricing docs, Anthropic docs, Google AI Studio, Hugging Face — and use browser-based counters that never upload your prompts.

What a token is (and is not)

A token is the smallest chunk of text a language model sees. It is not a word. It is not a character. It is a substring that the tokenizer learned to recognize during training because it appeared often enough in the corpus to earn a dedicated ID in the vocabulary.

For English prose the ratio settles near four characters per token, or roughly three tokens for every four words. That ratio is a shortcut, not a rule. Run a paragraph of Python through the same tokenizer and you will see 1.3 to 1.7 tokens per word because snake_case names split on underscores and operators. Run a paragraph of Japanese and you can hit two to three tokens per character because the byte-level BPE is representing multi-byte UTF-8 sequences. Run a URL with query parameters and watch each ampersand and equals sign become its own token.

The reason providers bill per token rather than per character is that tokens are the model's atomic unit of compute. Every forward pass through the transformer processes a batch of tokens, and every generated reply costs one forward pass per output token in the autoregressive phase. Tokens map directly onto GPU cycles, which is why the price sheet is denominated in them.

If you want to confirm token counts before sending a prompt, run the exact text through ChatGPT Token Counter for OpenAI models, Claude Tokenizer for Anthropic, or Gemini Token Counter for Google. The counts will disagree slightly — that is the point. Each family trained its own tokenizer.

Byte-pair encoding in five minutes

Byte-pair encoding, usually written BPE, is the algorithm behind almost every production tokenizer shipping in 2026. The idea is due to Philip Gage's 1994 compression paper but the LLM adaptation traces to OpenAI's GPT-2 and the subsequent refinements by Sennrich, Karpathy, and the Hugging Face team. Karpathy's 2024 talk on building the GPT tokenizer walks through the full algorithm if you want a two-hour deep dive.

The algorithm starts from bytes. Take the string "lower". Encoded as UTF-8 it is five bytes: l o w e r. Each byte is a token. That is your vocabulary of 256. Now take a large training corpus and count the most common adjacent byte pair. Maybe it is e r because English has a lot of words ending in that suffix. Merge it into a new token with its own ID: er. Now the string becomes l o w er. Count pairs again. Maybe lo is next. Merge. Now the string is lo w er. Repeat 30,000 to 100,000 times and you have a vocabulary that efficiently represents the patterns in your training data.

The consequence is that common substrings become single tokens and rare substrings stay fragmented. "hello" is one token on most tokenizers. "antidisestablishmentarianism" might be six. A curly brace followed by a newline might be its own token because it appeared so often in code that it earned a slot. The tokenizer is, in effect, a compressed summary of its training corpus.

Two operational implications fall out of this. First, unusual text — proper nouns, obscure technical terminology, long hex strings — uses more tokens than prose of the same length. Second, tokenizers trained on different corpora produce different splits, which is why the three big providers give different token counts for the same input. For code-heavy prompts, GPT-4o's tokenizer tends to be slightly more efficient than Claude's; for multilingual content, Gemini's tokenizer tends to be more efficient than both.

The April 2026 pricing table

These are the headline per-million-token rates for the flagship and workhorse tiers as of April 20, 2026. Prices move — verify with each provider's pricing page before committing to a spend plan. All figures are USD per million tokens.

Model Input Cached input Output Context
GPT-5 (OpenAI)$1.25~$0.625$10.00400K
GPT-5.4 (OpenAI)$2.50~$1.25$10.00400K
GPT-5 Mini$0.15~$0.075$0.60400K
Claude Opus 4.5 (Anthropic)$5.00$0.50$25.001M
Claude Sonnet 4.5$3.00$0.30$15.001M
Claude Haiku 4.5$0.80$0.08$4.00200K
Gemini 2.5 Pro (Google)$1.25$0.31$10.002M
Gemini 2.5 Flash$0.30$0.075$2.501M
Llama 3.3 70B (Together)$0.88n/a$0.88128K
Llama 3.3 70B (Groq)$0.59n/a$0.79128K

A few observations. First, output is between 4x and 8x the price of input on flagship tiers. That matters because every token you generate is more expensive than every token you send. Second, Anthropic's cache hit price is the most aggressive in the industry at 10 percent of normal input. Third, Llama hosted through Groq or Together has symmetric input/output pricing, which changes the arithmetic for workloads that are output-heavy.

For quick cross-model math, AI Cost Comparison takes a token count and returns a side-by-side table for each provider. LLM Price Calculator accepts input and output counts separately and handles the asymmetric pricing cleanly. LLM Token Cost Comparator is useful when you want to spot which provider wins for a specific input/output ratio.

Three realistic cost examples

Pricing tables are abstract. Real budgets live in workloads. Three examples:

Example 1: Customer support chatbot with 1,000 MAU

Assume 1,000 monthly active users. Each user averages 15 messages per month. Each round trip is 400 input tokens (system prompt + user message + retrieved context) and 120 output tokens. Total tokens per user per month: 15 × 520 = 7,800. Total across the user base: 7.8M.

Break that 7.8M into 6M input and 1.8M output — roughly the ratio from real chatbot traffic. Now apply the pricing:

  • GPT-5: (6 × $1.25) + (1.8 × $10) = $7.50 + $18.00 = $25.50/month
  • Claude Opus 4.5: (6 × $5) + (1.8 × $25) = $30 + $45 = $75/month
  • Gemini 2.5 Flash: (6 × $0.30) + (1.8 × $2.50) = $1.80 + $4.50 = $6.30/month
  • Claude Haiku 4.5: (6 × $0.80) + (1.8 × $4) = $4.80 + $7.20 = $12/month

If the system prompt is stable and 2,000 tokens long, caching it on Anthropic drops the effective input cost by 90 percent after the first call, bringing Opus down to around $55/month on the same traffic. That is the lever that makes premium models viable at scale.

Example 2: RAG over 10,000 documents

Documents are retrieved from a vector database and injected into the prompt. Assume the average retrieved chunk is 4,000 tokens and three chunks are used per query. That is 12,000 tokens of retrieval context per query, plus maybe 1,500 tokens of system prompt and user message, for 13,500 input tokens. Output is 400 tokens.

At 100,000 queries per month — a mid-size internal tool — you are sending 1.35 billion input tokens and 40 million output tokens. On GPT-5 that comes to (1,350 × $1.25) + (40 × $10) = $1,687 + $400 = $2,087/month. On Gemini 2.5 Flash the same traffic is $445/month. On Claude Opus 4.5 it is $7,750/month, unless you cache the retrieved context for repeated documents, which can drop that significantly if your query distribution has a long tail of popular documents.

For embeddings in the retrieval step, OpenAI's text-embedding-3-large is $0.13 per million tokens. If you embed 10,000 documents averaging 8,000 tokens once, that is 80M tokens at $10.40. Indexing costs are almost always trivial compared to query costs.

You can model the retrieval and generation cost together with LLM Embedding Cost Calculator and Prompt Token Budget Calculator.

Example 3: Code completion copilot

Code suggestions are the opposite shape from chat. Input is long (whole file plus cursor context) and output is short (a few dozen tokens). A developer using copilot actively sends roughly 20 requests per hour with 2,500 input tokens and 50 output tokens each. Across a 40-hour week, one developer generates 800 requests, 2M input tokens, and 40K output tokens.

Across 50 developers for a full month, you land near 8 billion request count... scratch that, 50 × 4 weeks × 800 = 160,000 requests, 400M input tokens, 8M output tokens. On GPT-5: (400 × $1.25) + (8 × $10) = $500 + $80 = $580/month for 50 devs, or about $11.60 per seat. That beats most copilot subscriptions on list price.

The key optimization for this shape is prompt caching on the file prefix and aggressive output truncation via max_tokens. Code completions rarely need more than 200 tokens.

Context windows and the hidden tax

Context windows are advertised as capacity. They are billed as consumption. When Gemini 2.5 Pro advertises a 2-million-token context, that number is the ceiling, not the cost. You pay for every token you actually send on every request.

This is where naive implementations hemorrhage money. A chat app that passes the full conversation on every turn grows linearly with turn count. By turn 20 of a moderately long conversation, you might be sending 40,000 tokens per request to answer a one-sentence follow-up. Multiply by a user base and the input cost goes parabolic.

The defense is conversation summarization. Once the window exceeds some threshold (8,000 tokens is common), compress the older turns into a summary and keep only the last N verbatim. Most frameworks — LangChain, LlamaIndex, Semantic Kernel — ship memory classes that implement this pattern. The principle is provider-agnostic.

For visualizing where your conversation sits in the window, LLM Context Window Visualizer renders a proportional bar so you can see at a glance how much of the budget you have spent.

Prompt caching: the 90 percent lever

Prompt caching is the single highest-leverage cost optimization available in 2026. It works because most LLM prompts share a long, stable prefix — a system prompt, a tool description, a set of few-shot examples — followed by a short dynamic tail. The prefix is identical across thousands of requests. Caching the prefix's internal representation on the provider side means you do not pay full price to reprocess it.

The two approaches differ:

Anthropic: manual caching with 10 percent read cost

You mark the stable prefix with a cache_control block. Anthropic stores the prefill state for 5 minutes by default (or 1 hour for 2x the write cost). Cache hits within that TTL bill at 10 percent of normal input. A 2,000-token system prompt that you send 100 times per hour, cached with 5-minute TTL, saves about $0.22/hour on Opus 4.5 vs the uncached path. That seems small until you project it annually and add more users.

// Anthropic prompt caching (2026 API)
{
  "model": "claude-opus-4-5",
  "system": [
    {
      "type": "text",
      "text": "<long stable system prompt here>",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    { "role": "user", "content": "Today's question..." }
  ]
}

OpenAI: automatic caching with 50 percent read cost

OpenAI detects repeated prefixes automatically on prompts over roughly 1,024 tokens. There is no API change required. Cache hits bill at about half the normal input rate. It is less aggressive than Anthropic but requires zero code changes, which is a legitimate tradeoff. OpenAI publishes cache hit rates in the response metadata so you can see what is actually being served from cache.

Gemini: context caching with per-hour storage fee

Google separates caching into an explicit API with a storage-hour fee plus discounted reads. For long-lived prefixes that get heavy traffic (think knowledge bases re-queried all day), Gemini's caching can beat both.

The quick sanity check: caching pays off after one or two reads on Anthropic's 5-minute TTL, and after roughly four reads on the 1-hour TTL. If your prefix is long and gets reused, turn it on. If your prefix is short or rarely repeats, it is not worth the complexity. Batch API Cost Calculator and AI Usage ROI Calculator can model the break-even.

Self-hosting vs API: where the line is

Self-hosting feels cheap until you price the operational cost. Let us run the math.

Llama 3.3 70B on a single H100 GPU runs at roughly 80-140 tokens per second for a single stream with vLLM, or 2,000-4,000 tokens per second in batched inference. H100 on-demand pricing in April 2026 runs $2-3 per hour across the major GPU clouds; reserved instances drop to around $1.50.

Take the middle ground: 3,000 tokens/second effective throughput, $2/hour GPU cost. That is 10.8M tokens/hour at $2, or $0.185 per million tokens. Compare that to Llama 3.3 through Together at $0.88/million or Groq at $0.59/million (input). Self-hosting is cheaper per token — if you can keep the GPU saturated.

You cannot. Real workloads have diurnal traffic, spiky usage, and idle periods. If your GPU is running at 30 percent utilization, your effective per-token cost triples. Plus you pay for the GPU when the app is idle. Plus you pay an engineer to run the stack.

The rough rule of thumb as of April 2026: API beats self-host below 5 million tokens per day of steady-state traffic. Above 50 million tokens per day, self-hosting on committed GPU capacity is likely cheaper. In between is a judgment call that comes down to how much engineering time you are willing to spend optimizing utilization. For fine-tuning cost modeling, LLM Fine Tuning Cost Estimator helps size the training run against expected inference volume.

Six patterns that silently burn tokens

1. Pretty-printed JSON in prompts

JSON with two-space indentation and newlines uses 2-3x the tokens of the same JSON minified. If you are sending structured data to the model, JSON.stringify(data) without the indent argument is effectively free savings.

2. Verbose system prompts

A 3,000-token system prompt on every request adds up fast. Audit it. Most production system prompts can be cut to 1,000-1,500 tokens with no quality loss. If you cannot cut it, cache it.

3. Uncapped max_tokens

If you do not set max_tokens, the model will sometimes decide to write an essay. Set it. For classification, 20 is plenty. For summarization, 500 is usually enough. Output is the expensive side of the invoice.

4. Reprocessing documents on every query

If you have a knowledge base, embed it once and retrieve relevant chunks. Do not paste the whole document on every call. RAG exists exactly because context-stuffing is expensive and usually hurts quality.

5. Full chat history every turn

Covered above. Summarize at threshold.

6. Leaky tool definitions

If you are using function calling with 20 tool schemas, those schemas are sent every single request. Trim the toolset dynamically based on the query type. Most queries need three tools, not twenty. Function Call Token Calculator shows exactly what each tool schema costs per call.

The calculators worth using

Rather than guess, measure. For most workflows three tools are enough: a tokenizer for the specific model, a cost comparator for picking the provider, and a budget calculator for projecting monthly spend.

All of these run entirely in the browser. They never upload your prompts, which matters when you are tokenizing internal strategy documents, customer support transcripts, or code from a private repository. Check DevTools → Network on any of them and you will see zero outbound requests beyond static assets.

Related guide

For a deeper dive into how the flagship models compare on capability (not just price), see Claude 4.7 vs GPT-5.4 vs Gemini 3.1: The April 2026 Benchmark. For prompt caching mechanics in depth, see Prompt Caching and Batch API Cost Reduction in 2026.

FAQ

What does a GPT-5 token actually cost in 2026?

GPT-5 standard is $1.25 per million input tokens and $10 per million output tokens. GPT-5.4 (the extended thinking variant) runs $2.50 input and $10 output. Cached input on OpenAI bills at roughly half the normal rate after automatic cache detection kicks in.

How do I compare Claude, GPT, and Gemini for the same prompt?

Tokenize the prompt with each provider's tokenizer (they disagree), then multiply the count by the current per-million rate. Use AI Cost Comparison to do it in one pass. Do not rely on word-count shortcuts; tokenizer boundaries differ and a 500-word prompt can be 620 tokens on one model and 740 on another.

Does prompt caching really cut costs by 90 percent?

On Anthropic, yes — cached input tokens bill at 10 percent of the normal rate. You pay a small cache-write premium (1.25x for 5-minute TTL), which means caching breaks even after one or two reads. OpenAI's automatic caching cuts closer to 50 percent with zero code changes. Gemini has its own API with a separate storage fee.

Is self-hosting Llama cheaper than the GPT-5 API?

Only at volume. The inflection point sits around 5-10 million tokens per day of steady traffic. Below that, API pricing on hosted Llama through Groq or Together wins because you are not paying for idle GPU time. Above 50 million tokens per day, self-hosting on committed GPUs is usually cheaper but requires engineering investment.

How many tokens is one English word?

About 0.75 on average for English prose. That rises toward 1.3-1.7 for code and can hit 2-3 for non-Latin scripts or text with lots of punctuation and URLs. Always measure the actual text with the actual tokenizer before projecting a bill.

Why are output tokens more expensive than input?

Output tokens are produced one at a time through the full transformer stack (autoregressive generation), while input tokens are processed in parallel during prefill. The hardware cost of generation is higher, and providers price it accordingly — output is typically 4-8x the input price on flagship tiers.

What is the cheapest way to run a chatbot with 1,000 MAU?

At typical usage patterns (15 messages per user per month, 500 tokens round-trip), a 1,000 MAU chatbot lands near 7.5M tokens per month. On Gemini 2.5 Flash that is under $5/month. On GPT-5 it is around $25. On Claude Opus 4.5 around $75. Most production chatbots route trivial queries to a cheap model and escalate hard ones to a flagship, which typically cuts the bill 60-80 percent with minimal quality loss.

Closing thought

Token economics is dull until the bill arrives. The teams that ship LLM features without billing drama are the ones that count tokens as a first-class resource — measured with the real tokenizer, budgeted per feature, watched with alerts, and optimized with caching when the prefix is stable. Spend an afternoon running your real prompts through a browser-based counter and graphing the distribution by endpoint. Once you can see the distribution, the optimizations pick themselves.