Skip to content

Model Optimization & Quantization

Definition

Techniques to make ML models faster and smaller for production deployment — quantization, pruning, graph optimization, and hardware-specific compilation.


Optimization Techniques

Technique Speedup Size Reduction Accuracy Impact
INT8 quantization 2-4x 4x smaller < 1% loss
FP16 (half precision) 1.5-2x 2x smaller Negligible
Pruning (structured) 1.5-3x 2-5x smaller 1-3% loss
TensorRT optimization 2-5x Negligible
ONNX Runtime 1.5-3x None (same model)
Operator fusion 1.2-1.5x None
Dynamic batching Throughput 2-8x None

Quantization Types

Type Precision When to Use
FP32 Full precision Training, reference
FP16 Half precision GPU inference (default)
INT8 8-bit integer Production deployment, mobile
INT4 4-bit integer Aggressive compression, LLMs

Tools

Tool Platform Best For
TensorRT NVIDIA GPU Fastest GPU inference
ONNX Runtime Cross-platform Portability, CPU+GPU
OpenVINO Intel CPU/GPU Intel hardware optimization
CoreML Apple iOS/macOS deployment
TFLite Mobile Android/iOS deployment
NCNN Mobile Fastest mobile inference

Key Takeaways

Summary

  • INT8 quantization is the single most impactful optimization: 2-4x faster, < 1% accuracy loss
  • TensorRT for NVIDIA GPUs, ONNX Runtime for cross-platform, CoreML/TFLite for mobile
  • Combine techniques: quantization + operator fusion + batching = 5-10x total speedup
  • Always validate accuracy after optimization — quantization can amplify edge-case errors