SLM

Quantization

LLM Quantization: Making models faster and smaller

Vatsal Bajpai

12 min read·May 13, 2025

Cover Image for LLM Quantization: Making models faster and smaller

Quantization in LLMs: Making Models Faster & Smaller

LLMs are revolutionizing AI applications, but their massive size creates deployment challenges. Quantization offers a powerful solution by reducing model size and accelerating inference without significant performance loss. Let's dive into this essential technique for production LLM deployments.

What is Quantization?

Quantization reduces numerical precision of model weights and activations. Instead of storing values as 32-bit floating point (FP32), we convert them to lower precision formats:

FP16 (16-bit floating point)
INT8 (8-bit integer)
INT4 (4-bit integer)

This creates smaller models that require less memory bandwidth and enable faster matrix multiplications.

How Quantization Works: A Simple Example

Consider a weight matrix with FP32 values:

[1.2371, 0.8954, -0.3421, 2.1095]

Converting to INT8 requires:

Determining value range
Scaling to fit range (-127 to 127)
Rounding to integers

# Original FP32 values
weights = [1.2371, 0.8954, -0.3421, 2.1095]

# Find min/max for scaling
scale = max(abs(min(weights)), abs(max(weights))) / 127
print(f"Scale factor: {scale}")

# Quantize to INT8
quantized = [round(w / scale) for w in weights]
print(f"Quantized INT8: {quantized}")

# Dequantize to recover approximate values
dequantized = [q * scale for q in quantized]
print(f"Dequantized: {dequantized}")

Output:

Scale factor: 0.0166
Quantized INT8: [74, 54, -21, 127]
Dequantized: [1.2284, 0.8964, -0.3486, 2.1082]

The dequantized values are close to the originals, demonstrating how quantization maintains reasonable accuracy while using just 25% of the storage.

Quantization Techniques for LLMs

1. Post-Training Quantization (PTQ)

PTQ applies quantization after model training without further fine-tuning:

# Simplified PTQ implementation
def post_training_quantize(model, bits=8):
    for layer in model.layers:
        if hasattr(layer, 'weights'):
            # Determine scaling factor for this layer
            weights = layer.weights.flatten()
            scale = max(abs(weights.min()), abs(weights.max())) / (2**(bits-1) - 1)
            
            # Quantize weights
            quantized = np.round(weights / scale).astype(np.int8)
            
            # Store quantized weights and scale
            layer.quantized_weights = quantized
            layer.scale = scale
    
    return model

Popular frameworks implement more sophisticated PTQ algorithms:

Zero-point adjustment for asymmetric ranges
Per-channel quantization for improved accuracy
Calibration with representative data

2. Quantization-Aware Training (QAT)

QAT simulates quantization during training to minimize accuracy loss:

class QuantizedLinear(nn.Module):
    def __init__(self, in_features, out_features, bits=8):
        super().__init__()
        self.weight = nn.Parameter(torch.Tensor(out_features, in_features))
        self.bits = bits
        self.scale = None
        
    def forward(self, x):
        if self.training:
            # Calculate dynamic scale
            self.scale = self.weight.abs().max() / (2**(self.bits-1) - 1)
            
            # Simulate quantization and dequantization
            quantized = torch.round(self.weight / self.scale)
            clamped = torch.clamp(quantized, -2**(self.bits-1), 2**(self.bits-1) - 1)
            dequantized = clamped * self.scale
            
            # Forward with simulated quantized weights
            return F.linear(x, dequantized)
        else:
            # Use actual quantized weights during inference
            return F.linear(x, self.get_quantized_weights())
    
    def get_quantized_weights(self):
        quantized = torch.round(self.weight / self.scale)
        clamped = torch.clamp(quantized, -2**(self.bits-1), 2**(self.bits-1) - 1)
        return clamped * self.scale

3. Weight-Only vs. Activation Quantization

Weight-only: Quantizes just the model weights, easier to implement
Weight+Activation: Quantizes both weights and activations, greater savings but more complex

Advanced Quantization Methods for LLMs

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ is an optimized, layer-by-layer quantization method that:

Quantizes one layer at a time
Uses Hessian information to minimize error
Redistributes quantization errors to remaining weights

# Simplified GPTQ pseudocode
def gptq(model, calibration_data):
    for layer in reversed(model.layers):
        H = compute_hessian(layer, calibration_data)  # Approximate Hessian
        W = layer.weights
        
        for i in range(W.shape[0]):
            # Optimize quantization for each row
            q, scale = quantize_row(W[i], H[i])
            
            # Compute quantization error
            error = W[i] - (q * scale)
            
            # Redistribute error to remaining weights
            if i < W.shape[0] - 1:
                W[i+1:] += error @ H[i+1:, i] / H[i+1:, i+1:]
            
            # Update with quantized weights
            W[i] = q * scale

AWQ (Activation-aware Weight Quantization)

AWQ preserves important weights by:

Identifying critical weights based on activation patterns
Applying different quantization scales to different weight groups
Being particularly effective for 4-bit quantization

Performance Comparison

Model	Precision	Size	Latency	Accuracy Drop
LLaMA-2-70B	FP16	140 GB	1.0x	Baseline
LLaMA-2-70B	INT8	70 GB	1.7x	-0.2%
LLaMA-2-70B	INT4	35 GB	2.8x	-0.8%
LLaMA-2-70B	GPTQ-INT4	35 GB	2.8x	-0.5%

Implementation Tools

Several frameworks make quantization more accessible:

🤗 Transformers & bitsandbytes: Simple quantization API

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b", 
    device_map="auto",
    load_in_8bit=True  # Enable INT8 quantization
)

ONNX Runtime: Optimized INT8 quantization for production

import onnxruntime as ort
quantized_model = ort.quantize_dynamic(
    model_path, 
    quantized_model_path,
    weight_type=onnx.QuantType.QInt8
)

vLLM: Efficient INT8/FP16 inference engine

from vllm import LLM
model = LLM(
    model="meta-llama/Llama-2-70b",
    quantization="awq",  # Apply AWQ 4-bit quantization
    dtype="half"  # Use FP16 for other operations
)

Practical Tips for Quantizing LLMs

Start with weight-only quantization - Simpler implementation with good results
Test different bit-widths - Try 8-bit first, then 4-bit if needed
Compare PTQ techniques - AWQ, GPTQ, and standard methods have different trade-offs
Evaluate on domain-specific tasks - Quantization effects vary by use case
Benchmark latency & throughput - Verify real-world performance improvements

Learn more on how Matter AI helps improve code quality across multiple languages in Pull Requests: https://docs.matterai.dev/product/code-quality

Are you looking for a way to improve your code review process? Learn more on how Matter AI helps team to solve code review challenges with AI: https://matterai.so

Share this Article:

Understanding Attention: Coherency in LLMs

How LLMs generate coherent text across long contexts

LLM Tokenisation fundamentals and working

What is LLM Tokenisation and how it works

Understanding LLM Context Window and Working

What is LLM Context Window and how it works

LLM Prompt Caching

What is LLM Prompt Caching and how it can help reduce LLM cost

How Matter AI brings Velocity, Cost Optimization and Governance to Engineering Teams

Dive into what and how MatterAI offers to engineering teams

Continue Reading

Understanding Attention: Coherency in LLMs

How LLMs generate coherent text across long contexts

LLM Tokenisation fundamentals and working

What is LLM Tokenisation and how it works

Understanding LLM Context Window and Working

What is LLM Context Window and how it works

LLM Quantization: Making models faster and smaller

Quantization in LLMs: Making Models Faster & Smaller

What is Quantization?

How Quantization Works: A Simple Example

Quantization Techniques for LLMs

1. Post-Training Quantization (PTQ)

2. Quantization-Aware Training (QAT)

3. Weight-Only vs. Activation Quantization

Advanced Quantization Methods for LLMs

GPTQ (Generative Pre-trained Transformer Quantization)

AWQ (Activation-aware Weight Quantization)

Performance Comparison

Implementation Tools

Practical Tips for Quantizing LLMs

More Articles

Understanding Attention: Coherency in LLMs

LLM Tokenisation fundamentals and working

Understanding LLM Context Window and Working

LLM Prompt Caching

How Matter AI brings Velocity, Cost Optimization and Governance to Engineering Teams

Continue Reading

Understanding Attention: Coherency in LLMs

LLM Tokenisation fundamentals and working

Understanding LLM Context Window and Working