May 2026LLM economics · inference · vLLM

OpenAI-compatible is not cost-compatible

A matching API surface tells you nothing about what a token actually costs to serve. Two endpoints can speak the same protocol and live in different economic universes.

Two inference endpoints can speak the exact same API and still live in completely different economic universes. "OpenAI-compatible" is a statement about a request schema. It says nothing about what a token costs to produce.

The illusion of a price

When a provider quotes a per-token price, they are quoting an average over a load profile you cannot see. Drop a /v1/chat/completions call into your code and it feels like a settled number — USD x per million tokens, done. But that number is a function of how busy their fleet was when they set it, not how busy your workload keeps a GPU.

Self-host the same open-weight model and the abstraction falls away. Now you own the fleet. The price is whatever your utilization makes it.

What actually moves the cost

The effective cost of a served token is shaped by a handful of variables that the API surface deliberately hides:

Request rate — bursty traffic leaves GPUs idle between spikes, and idle GPUs still cost money.
Concurrency — batching is where the economics live. Low concurrency means you pay H100 prices for A10 throughput.
Latency SLOs — a tight time-to-first-token forces smaller batches, which quietly raises cost per token.
Hardware, model, quantization — the obvious knobs, and the only ones most calculators expose.

You can write all of that as one honest expression:

C_eff = f(H, M, Q, λ, L)

The protocol you call doesn't appear anywhere in that function.

Why it matters

If you size a fleet or quote a customer off a calculator that assumes full utilization, you are pricing a best case you will almost never hit. On identical H100 hardware, the same model can swing from roughly USD 0.21 to USD 15.25 per million output tokens depending only on how it is loaded — a span of more than an order of magnitude, with not one line of the API changed.

OpenAI-compatible is a convenience. It is not a cost guarantee. The only way to know what you actually pay is to measure it under your own traffic — which is exactly what the vllm-cost-meter is for.