Inference¶ Serving strategies, latency/throughput trade-offs, and cost controls. Topics¶ Batch size, tensor/continuous batching KV cache management and paged attention Quantization (INT8/4), speculative decoding Token streaming and flow control