AI Developer Cluster

AI Token Counting: A Complete Guide to LLM Costs

Published April 11, 2026 · 11 min read

The first time a billing surprise hits, it is almost always the same shape: a team shipped a feature that summarizes customer tickets, the model worked beautifully in testing, and then the invoice arrived with three extra digits. Nobody did anything wrong. They just did not realize that "500 words of ticket" was not 500 tokens — it was closer to 700 — and that they were also paying for the system prompt, the few-shot examples, and the model's own response on every single call. Token counting is not an academic detail. It is the unit your bill is denominated in.

This guide is for developers and builders who want to stop guessing. We will cover what a token actually is, how byte-pair encoding decides where to cut your text, how the big three providers differ, how context windows interact with pricing, and how to estimate a workload before you run it so there are no surprise invoices.

What a token actually is

A token is the smallest chunk of text the model actually sees. It is not a letter and it is not a word. For English prose, the rough heuristic is that one token corresponds to about four characters or roughly three-quarters of a word. That ratio breaks the moment your text contains code, JSON, non-Latin scripts, long URLs, or unusual punctuation. A paragraph of Python can easily hit 1.5 tokens per word. A paragraph of Japanese can hit two to three tokens per character. The heuristic is a shortcut, not a measurement.

The reason the ratio varies is that the tokenizer is not splitting on spaces. It is splitting on a vocabulary it learned during training, and that vocabulary is optimized for common patterns in the training data. "Hello" is probably one token. "antidisestablishmentarianism" might be six. A curly brace followed by a newline might be its own token if it appeared often enough during training. The only way to know is to run the text through the same tokenizer the model uses.

Byte-pair encoding in 90 seconds

Almost every modern LLM uses a variant of byte-pair encoding. BPE starts with individual bytes and iteratively merges the most frequent adjacent pairs into new tokens. Run this process long enough and you end up with a vocabulary that contains common words as single tokens, common prefixes and suffixes as their own tokens, and rare strings decomposed into shorter pieces. GPT uses a BPE variant called tiktoken; Claude and Gemini use their own variants with their own vocabularies. The algorithm is similar; the vocabularies are not.

This matters because the same input string produces different token counts across providers. A 1,000-word English document might tokenize to 1,300 tokens on GPT-4, 1,250 on Claude, and 1,350 on Gemini. None of these numbers is wrong. They reflect different training vocabularies. The official tiktoken library is the reference for OpenAI models, and Anthropic publishes a token counting endpoint for Claude. Google documents Gemini tokenization in the Gemini API docs.

To inspect real prompts without hitting an API, use ChatGPT Token Counter. It runs a JavaScript port of tiktoken in your browser, so pastes stay on the page during standard processing. For Claude and Gemini, the dedicated Claude Tokenizer and Gemini Token Counter use provider-specific vocabularies and give you an accurate count for each model family.

The big three and why their counts differ

OpenAI (GPT-4.1, GPT-4o, o-series)

OpenAI bills per 1,000 input tokens and per 1,000 output tokens separately, with output tokens usually more expensive than input. The cl100k_base and o200k_base vocabularies power the current models. Prices are published on the OpenAI API pricing page, and they change often enough that you should hardcode them in exactly one place in your app. For quick cross-model comparisons, LLM Price Calculator lets you plug in token counts and see the cost in every current model side by side.

Anthropic (Claude 3.5 Sonnet, Opus, Haiku)

Claude models use a distinct tokenizer. Input and output are priced separately as with OpenAI. Anthropic also charges for cached prompt reads at a fraction of the full input rate — prompt caching is one of the most effective cost optimizations if your system prompt is long and repeated. The Claude prompt caching docs explain the mechanics and the exact discount factors.

Google (Gemini 1.5 Pro, 2.0 Flash)

Gemini pricing has a twist: it tiers based on prompt size. Prompts under a threshold are priced at one rate; prompts above the threshold are priced higher. If your workload oscillates around the boundary, small cuts to the prompt can move you into the lower tier and halve the bill. This is exactly the kind of thing you do not notice until you look at the invoice. See the Gemini API pricing page for the current thresholds.

Context windows and the hidden tax

A 200k-token context window does not mean you should use 200k tokens. Long contexts have three hidden taxes. First, you pay for every token in the prompt on every call — whether the model reads them attentively or not. Second, attention quality degrades toward the middle of very long contexts, a phenomenon sometimes called "lost in the middle" and documented in Liu et al.'s 2023 paper of the same name. Third, latency scales with context size, so a chatty UI built on a maximum-context prompt feels sluggish.

The practical advice: put only what the model needs into the prompt. Use retrieval to pull in the right chunks instead of stuffing the whole knowledge base. Visualize how much of the window you are using before you ship with LLM Context Window Visualizer, which lays out your system prompt, examples, and user message against the window capacity so you can see what is taking up space. For day-to-day budgeting, Prompt Token Budget Calculator runs through system-prompt, history, retrieval, and output allocations so you can size each section before wiring it in.

Estimating a workload before you run it

The workflow we recommend for anything above a prototype:

Write the system prompt and a realistic example user message.
Count input tokens for the combined prompt using the correct tokenizer for your model.
Estimate expected output tokens — this is the hard one; set a reasonable max and enforce it.
Multiply by requests per day, then by 30, to get monthly cost.
Multiply by 3 as a safety margin. Real traffic is messier than your estimate.

For step 4, plug the counts into LLM Price Calculator. For tasks that will eventually need fine-tuning, LLM Fine-Tuning Cost Estimator projects training and inference costs against the base pricing so you can compare the "just prompt harder" path against the "train a specialist model" path honestly. Tokens also show up in embedding workloads — the LLM Embedding Cost Calculator handles that side of the bill separately, since embedding models are priced much lower than generation models.

If you are still drafting payloads by hand, a quick pass through JSON Formatter often reveals pretty-printing that you should strip before sending to the API — whitespace costs tokens too. A minifier pass on JSON responses you embed in prompts can save 20 percent effortlessly.

Patterns that burn tokens silently

The repeated system prompt

A 2,000-token system prompt on every single call is 2,000 tokens you pay for on every call. If your model supports prompt caching, turn it on. If it does not, consider whether every instruction in there actually changes model behavior or is there defensively "just in case."

The pretty-printed JSON blob

Inlining a 50-key JSON object with two-space indentation is dramatically more tokens than the same object minified. Strip whitespace before you paste it into a prompt. Your results will not change; your bill will.

Unbounded chat history

Pass the whole conversation every turn and you are linearly growing tokens with turn count. Summarize old turns into a compact state once the conversation gets long, or use a sliding window that keeps the last N turns plus the summary. A Python-first team might do this with LangChain's memory classes; the principle is the same in any framework.

Uncapped output

If you do not set max_tokens, the model will sometimes decide to write an essay when you wanted a sentence. Set it. Tight output bounds are one of the most effective cost controls because output is the expensive half of most pricing sheets.

Related pillar guide

This cluster post is part of the developer tools track. For the broader foundation on choosing and using free developer tools — including privacy, local vs cloud trade-offs, and the full toolkit — see The Complete Guide to Free Online Tools.

FAQ

Why does the same text produce different token counts on different models?

Each model family trained its tokenizer on its own corpus and ended up with a different vocabulary. The BPE merges are different, so the splits are different. A 1,000-word document will land within a couple hundred tokens across providers, but it will not be identical. Always count with the tokenizer that matches the model you plan to call.

Is there a way to count tokens without installing anything?

Yes. Browser-based counters like ChatGPT Token Counter use WebAssembly or JS ports of the official tokenizers and never upload your text. That is important when your prompts contain customer data or internal strategy documents. Check DevTools → Network to confirm nothing goes out.

Does whitespace really cost tokens?

Yes. Every space, newline, and tab is part of the input. Pretty-printed JSON uses significantly more tokens than minified JSON carrying the same data. If you are embedding large data payloads, minify before sending.

What is prompt caching and when should I use it?

Prompt caching lets the provider store the KV state of a long, fixed prefix of your prompt and reuse it on subsequent calls at a fraction of the cost. Use it when your system prompt or few-shot examples are long and repeated across many requests. Anthropic and Google both support it today, with different APIs and discount factors; the Anthropic docs cover the mechanics for Claude.

Can I predict output tokens before calling the model?

Not exactly, but you can bound them with max_tokens and estimate from the task type. Classification tasks produce a handful of tokens. Summaries produce roughly half the input length. Extraction produces roughly ten percent. Always cap with max_tokens and log actual usage so you can refine your estimate over time.

Closing thought

Tokens are boring until they are not. The teams that ship LLM features without billing drama are the ones that treat tokens as a first-class resource — counted, budgeted, and watched. Spend an afternoon running your real prompts through a browser tokenizer and plotting the distribution. The answers will surprise you, and they will pay for themselves the first month.