Research · independent

LLM inference economics

Measuring what serving open-weight models actually costs under production-like load — not what a calculator assumes.

Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation

A concurrency-aware methodology for LLM infrastructure cost estimation.

Public LLM cost calculators often reduce serving economics to a static token price or assumed utilization. This work studies how request rate, concurrency, latency SLOs, hardware, model architecture, and quantization interact to change the real effective cost of self-hosted LLM inference.

arXiv abstract PDF Code · vllm-cost-meter Google Scholar

arXiv: 2606.11690
Pages: 26
Figures: 9
Tool: vllm-cost-meter

36.3×Underutilization penaltyeffective cost near idle vs. saturated, same hardware

0.21 → 15.25USD / 1M output tokenseffective cost range on identical H100 hardware

42Benchmark validation runsacross load and concurrency regimes

H100 / A100Hardware validatedtwo GPU classes, measured not assumed

Figures as reported in the paper. See repository for benchmark methodology and reproduction.

Token price is not serving cost.

A per-token price is an average over a load profile you cannot see. It is a billing convenience, not a measurement of what a token costs to produce on your own hardware.

Load and concurrency shape the economics.

Real traffic is bursty. Idle GPUs still cost money, tight latency SLOs shrink batches, and both push effective cost far away from the fully-utilized best case.

Method

A cost function, stated honestly

C_eff = f(H, M, Q, λ, L)

C_eff: effective cost per million output tokens
H: hardware (GPU class, count)
M: model architecture
Q: quantization
λ: request arrival rate
L: latency SLO (TTFT / TPOT)

Tool

vllm-cost-meter

The companion artifact — a read-only meter that turns live vLLM telemetry into effective cost.

Flagship · open source

vllm-cost-meter

Objective live telemetry + effective cost-per-million-token meter for vLLM servers.

A read-only observer for running vLLM servers that ingests Prometheus metrics and surfaces live effective LLM serving cost against the operator's actual traffic.

View on GitHub

Reads vLLM Prometheus metrics
Tracks throughput, request rate, TTFT, TPOT, E2E latency, prompt / generation lengths, batch state, and KV cache
Computes live effective cost-per-million-token visibility
Ships benchmark reference curves from the paper
Independent, reproducible research artifact

Why this matters

Operators deserve the truth

Operators size GPU fleets and quote prices off calculators that assume full utilization. Real traffic is bursty, so the assumed cost and the billed reality diverge — sometimes by more than an order of magnitude.

A meter that reads live telemetry tells the truth about cost under your actual load. A calculator only tells you a best case you will rarely hit.

Open questions

Where the research goes next

01How should latency SLOs be priced when they force lower batch sizes?
02What does a fair, load-aware cost benchmark look like across model families?
03Where does speculative decoding move the cost curve, and for whom?

Cite

Citation

@misc{patil2026beyond,
  title  = {Beyond Per-Token Pricing: A Concurrency-Aware Methodology
            for LLM Infrastructure Cost Estimation},
  author = {Patil, Chitral},
  year   = {2026},
  eprint = {2606.11690},
  archivePrefix = {arXiv},
  primaryClass  = {cs.DC}
}