跳转至

Inference

Serving strategies, latency/throughput trade-offs, and cost controls.

Topics

  • Batch size, tensor/continuous batching
  • KV cache management and paged attention
  • Quantization (INT8/4), speculative decoding
  • Token streaming and flow control