SLM

Quantization

LLM Quantization: Making models faster and smaller

Vatsal Bajpai
Vatsal Bajpai
12 min read·
Cover Image for LLM Quantization: Making models faster and smaller

Quantization in LLMs: Making Models Faster & Smaller

LLMs are revolutionizing AI applications, but their massive size creates deployment challenges. Quantization offers a powerful solution by reducing model size and accelerating inference without significant performance loss. Let's dive into this essential technique for production LLM deployments.

What is Quantization?

Quantization reduces numerical precision of model weights and activations. Instead of storing values as 32-bit floating point (FP32), we convert them to lower precision formats:

  • FP16 (16-bit floating point)
  • INT8 (8-bit integer)
  • INT4 (4-bit integer)

This creates smaller models that require less memory bandwidth and enable faster matrix multiplications.

How Quantization Works: A Simple Example

Consider a weight matrix with FP32 values:

[1.2371, 0.8954, -0.3421, 2.1095]

Converting to INT8 requires:

  1. Determining value range
  2. Scaling to fit range (-127 to 127)
  3. Rounding to integers
# Original FP32 values
weights = [1.2371, 0.8954, -0.3421, 2.1095]

# Find min/max for scaling
scale = max(abs(min(weights)), abs(max(weights))) / 127
print(f"Scale factor: {scale}")

# Quantize to INT8
quantized = [round(w / scale) for w in weights]
print(f"Quantized INT8: {quantized}")

# Dequantize to recover approximate values
dequantized = [q * scale for q in quantized]
print(f"Dequantized: {dequantized}")

Output:

Scale factor: 0.0166
Quantized INT8: [74, 54, -21, 127]
Dequantized: [1.2284, 0.8964, -0.3486, 2.1082]

The dequantized values are close to the originals, demonstrating how quantization maintains reasonable accuracy while using just 25% of the storage.

Quantization Techniques for LLMs

1. Post-Training Quantization (PTQ)

PTQ applies quantization after model training without further fine-tuning:

# Simplified PTQ implementation
def post_training_quantize(model, bits=8):
    for layer in model.layers:
        if hasattr(layer, 'weights'):
            # Determine scaling factor for this layer
            weights = layer.weights.flatten()
            scale = max(abs(weights.min()), abs(weights.max())) / (2**(bits-1) - 1)
            
            # Quantize weights
            quantized = np.round(weights / scale).astype(np.int8)
            
            # Store quantized weights and scale
            layer.quantized_weights = quantized
            layer.scale = scale
    
    return model

Popular frameworks implement more sophisticated PTQ algorithms:

  • Zero-point adjustment for asymmetric ranges
  • Per-channel quantization for improved accuracy
  • Calibration with representative data

2. Quantization-Aware Training (QAT)

QAT simulates quantization during training to minimize accuracy loss:

class QuantizedLinear(nn.Module):
    def __init__(self, in_features, out_features, bits=8):
        super().__init__()
        self.weight = nn.Parameter(torch.Tensor(out_features, in_features))
        self.bits = bits
        self.scale = None
        
    def forward(self, x):
        if self.training:
            # Calculate dynamic scale
            self.scale = self.weight.abs().max() / (2**(self.bits-1) - 1)
            
            # Simulate quantization and dequantization
            quantized = torch.round(self.weight / self.scale)
            clamped = torch.clamp(quantized, -2**(self.bits-1), 2**(self.bits-1) - 1)
            dequantized = clamped * self.scale
            
            # Forward with simulated quantized weights
            return F.linear(x, dequantized)
        else:
            # Use actual quantized weights during inference
            return F.linear(x, self.get_quantized_weights())
    
    def get_quantized_weights(self):
        quantized = torch.round(self.weight / self.scale)
        clamped = torch.clamp(quantized, -2**(self.bits-1), 2**(self.bits-1) - 1)
        return clamped * self.scale

3. Weight-Only vs. Activation Quantization

  • Weight-only: Quantizes just the model weights, easier to implement
  • Weight+Activation: Quantizes both weights and activations, greater savings but more complex

Advanced Quantization Methods for LLMs

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ is an optimized, layer-by-layer quantization method that:

  1. Quantizes one layer at a time
  2. Uses Hessian information to minimize error
  3. Redistributes quantization errors to remaining weights
# Simplified GPTQ pseudocode
def gptq(model, calibration_data):
    for layer in reversed(model.layers):
        H = compute_hessian(layer, calibration_data)  # Approximate Hessian
        W = layer.weights
        
        for i in range(W.shape[0]):
            # Optimize quantization for each row
            q, scale = quantize_row(W[i], H[i])
            
            # Compute quantization error
            error = W[i] - (q * scale)
            
            # Redistribute error to remaining weights
            if i < W.shape[0] - 1:
                W[i+1:] += error @ H[i+1:, i] / H[i+1:, i+1:]
            
            # Update with quantized weights
            W[i] = q * scale

AWQ (Activation-aware Weight Quantization)

AWQ preserves important weights by:

  1. Identifying critical weights based on activation patterns
  2. Applying different quantization scales to different weight groups
  3. Being particularly effective for 4-bit quantization

Performance Comparison

Model Precision Size Latency Accuracy Drop
LLaMA-2-70B FP16 140 GB 1.0x Baseline
LLaMA-2-70B INT8 70 GB 1.7x -0.2%
LLaMA-2-70B INT4 35 GB 2.8x -0.8%
LLaMA-2-70B GPTQ-INT4 35 GB 2.8x -0.5%

Implementation Tools

Several frameworks make quantization more accessible:

  • 🤗 Transformers & bitsandbytes: Simple quantization API
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b", 
    device_map="auto",
    load_in_8bit=True  # Enable INT8 quantization
)
  • ONNX Runtime: Optimized INT8 quantization for production
import onnxruntime as ort
quantized_model = ort.quantize_dynamic(
    model_path, 
    quantized_model_path,
    weight_type=onnx.QuantType.QInt8
)
  • vLLM: Efficient INT8/FP16 inference engine
from vllm import LLM
model = LLM(
    model="meta-llama/Llama-2-70b",
    quantization="awq",  # Apply AWQ 4-bit quantization
    dtype="half"  # Use FP16 for other operations
)

Practical Tips for Quantizing LLMs

  1. Start with weight-only quantization - Simpler implementation with good results
  2. Test different bit-widths - Try 8-bit first, then 4-bit if needed
  3. Compare PTQ techniques - AWQ, GPTQ, and standard methods have different trade-offs
  4. Evaluate on domain-specific tasks - Quantization effects vary by use case
  5. Benchmark latency & throughput - Verify real-world performance improvements

Learn more on how Matter AI helps improve code quality across multiple languages in Pull Requests: https://docs.matterai.dev/product/code-quality

Are you looking for a way to improve your code review process? Learn more on how Matter AI helps team to solve code review challenges with AI: https://matterai.so

Share this Article: