跳转至

LLM Cody Wiki

Inference

Inference¶

Serving strategies, latency/throughput trade-offs, and cost controls.

Topics¶

Batch size, tensor/continuous batching
KV cache management and paged attention
Quantization (INT8/4), speculative decoding
Token streaming and flow control