Inference Systems

Making Neural Networks Smaller: Quantization and Pruning

Large language models are increasingly powerful but remain bottlenecked by memory, both for storing weights and for the KV cache that grows with context length. Reducing this footprint unlocks faster inference and makes deployment practical at scale. This talk traces the evolution of quantization and pruning methods from early network compression to today's frontier techniques, highlighting recurring challenges such as outliers, the quantization gap, and the tension between algorithmic compression and real-world speedups, along with recent work on KV cache compression — what everyone building and using AI needs to know about making models fit and run.

Speakers