Research · independent
LLM inference economics
Measuring what serving open-weight models actually costs under production-like load — not what a calculator assumes.
Featured paper
Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation
A concurrency-aware methodology for LLM infrastructure cost estimation.
Public LLM cost calculators often reduce serving economics to a static token price or assumed utilization. This work studies how request rate, concurrency, latency SLOs, hardware, model architecture, and quantization interact to change the real effective cost of self-hosted LLM inference.
- arXiv
- 2606.11690
- Pages
- 26
- Figures
- 9
- Tool
- vllm-cost-meter
Figures as reported in the paper. See repository for benchmark methodology and reproduction.
Token price is not serving cost.
A per-token price is an average over a load profile you cannot see. It is a billing convenience, not a measurement of what a token costs to produce on your own hardware.
Load and concurrency shape the economics.
Real traffic is bursty. Idle GPUs still cost money, tight latency SLOs shrink batches, and both push effective cost far away from the fully-utilized best case.
Method
A cost function, stated honestly
C_eff = f(H, M, Q, λ, L)- C_eff
- effective cost per million output tokens
- H
- hardware (GPU class, count)
- M
- model architecture
- Q
- quantization
- λ
- request arrival rate
- L
- latency SLO (TTFT / TPOT)
Tool
vllm-cost-meter
The companion artifact — a read-only meter that turns live vLLM telemetry into effective cost.
vllm-cost-meter
Objective live telemetry + effective cost-per-million-token meter for vLLM servers.
A read-only observer for running vLLM servers that ingests Prometheus metrics and surfaces live effective LLM serving cost against the operator's actual traffic.
- Reads vLLM Prometheus metrics
- Tracks throughput, request rate, TTFT, TPOT, E2E latency, prompt / generation lengths, batch state, and KV cache
- Computes live effective cost-per-million-token visibility
- Ships benchmark reference curves from the paper
- Independent, reproducible research artifact
Why this matters
Operators deserve the truth
Operators size GPU fleets and quote prices off calculators that assume full utilization. Real traffic is bursty, so the assumed cost and the billed reality diverge — sometimes by more than an order of magnitude.
A meter that reads live telemetry tells the truth about cost under your actual load. A calculator only tells you a best case you will rarely hit.
Open questions
Where the research goes next
- 01How should latency SLOs be priced when they force lower batch sizes?
- 02What does a fair, load-aware cost benchmark look like across model families?
- 03Where does speculative decoding move the cost curve, and for whom?
Cite
Citation
@misc{patil2026beyond,
title = {Beyond Per-Token Pricing: A Concurrency-Aware Methodology
for LLM Infrastructure Cost Estimation},
author = {Patil, Chitral},
year = {2026},
eprint = {2606.11690},
archivePrefix = {arXiv},
primaryClass = {cs.DC}
}