SLM
Quantization
LLM Quantization: Making models faster and smaller

Quantization in LLMs: Making Models Faster & Smaller
LLMs are revolutionizing AI applications, but their massive size creates deployment challenges. Quantization offers a powerful solution by reducing model size and accelerating inference without significant performance loss. Let's dive into this essential technique for production LLM deployments.
What is Quantization?
Quantization reduces numerical precision of model weights and activations. Instead of storing values as 32-bit floating point (FP32), we convert them to lower precision formats:
- FP16 (16-bit floating point)
- INT8 (8-bit integer)
- INT4 (4-bit integer)
This creates smaller models that require less memory bandwidth and enable faster matrix multiplications.
How Quantization Works: A Simple Example
Consider a weight matrix with FP32 values:
[1.2371, 0.8954, -0.3421, 2.1095]
Converting to INT8 requires:
- Determining value range
- Scaling to fit range (-127 to 127)
- Rounding to integers
# Original FP32 values
weights = [1.2371, 0.8954, -0.3421, 2.1095]
# Find min/max for scaling
scale = max(abs(min(weights)), abs(max(weights))) / 127
print(f"Scale factor: {scale}")
# Quantize to INT8
quantized = [round(w / scale) for w in weights]
print(f"Quantized INT8: {quantized}")
# Dequantize to recover approximate values
dequantized = [q * scale for q in quantized]
print(f"Dequantized: {dequantized}")
Output:
Scale factor: 0.0166
Quantized INT8: [74, 54, -21, 127]
Dequantized: [1.2284, 0.8964, -0.3486, 2.1082]
The dequantized values are close to the originals, demonstrating how quantization maintains reasonable accuracy while using just 25% of the storage.
Quantization Techniques for LLMs
1. Post-Training Quantization (PTQ)
PTQ applies quantization after model training without further fine-tuning:
# Simplified PTQ implementation
def post_training_quantize(model, bits=8):
for layer in model.layers:
if hasattr(layer, 'weights'):
# Determine scaling factor for this layer
weights = layer.weights.flatten()
scale = max(abs(weights.min()), abs(weights.max())) / (2**(bits-1) - 1)
# Quantize weights
quantized = np.round(weights / scale).astype(np.int8)
# Store quantized weights and scale
layer.quantized_weights = quantized
layer.scale = scale
return model
Popular frameworks implement more sophisticated PTQ algorithms:
- Zero-point adjustment for asymmetric ranges
- Per-channel quantization for improved accuracy
- Calibration with representative data
2. Quantization-Aware Training (QAT)
QAT simulates quantization during training to minimize accuracy loss:
class QuantizedLinear(nn.Module):
def __init__(self, in_features, out_features, bits=8):
super().__init__()
self.weight = nn.Parameter(torch.Tensor(out_features, in_features))
self.bits = bits
self.scale = None
def forward(self, x):
if self.training:
# Calculate dynamic scale
self.scale = self.weight.abs().max() / (2**(self.bits-1) - 1)
# Simulate quantization and dequantization
quantized = torch.round(self.weight / self.scale)
clamped = torch.clamp(quantized, -2**(self.bits-1), 2**(self.bits-1) - 1)
dequantized = clamped * self.scale
# Forward with simulated quantized weights
return F.linear(x, dequantized)
else:
# Use actual quantized weights during inference
return F.linear(x, self.get_quantized_weights())
def get_quantized_weights(self):
quantized = torch.round(self.weight / self.scale)
clamped = torch.clamp(quantized, -2**(self.bits-1), 2**(self.bits-1) - 1)
return clamped * self.scale
3. Weight-Only vs. Activation Quantization
- Weight-only: Quantizes just the model weights, easier to implement
- Weight+Activation: Quantizes both weights and activations, greater savings but more complex
Advanced Quantization Methods for LLMs
GPTQ (Generative Pre-trained Transformer Quantization)
GPTQ is an optimized, layer-by-layer quantization method that:
- Quantizes one layer at a time
- Uses Hessian information to minimize error
- Redistributes quantization errors to remaining weights
# Simplified GPTQ pseudocode
def gptq(model, calibration_data):
for layer in reversed(model.layers):
H = compute_hessian(layer, calibration_data) # Approximate Hessian
W = layer.weights
for i in range(W.shape[0]):
# Optimize quantization for each row
q, scale = quantize_row(W[i], H[i])
# Compute quantization error
error = W[i] - (q * scale)
# Redistribute error to remaining weights
if i < W.shape[0] - 1:
W[i+1:] += error @ H[i+1:, i] / H[i+1:, i+1:]
# Update with quantized weights
W[i] = q * scale
AWQ (Activation-aware Weight Quantization)
AWQ preserves important weights by:
- Identifying critical weights based on activation patterns
- Applying different quantization scales to different weight groups
- Being particularly effective for 4-bit quantization
Performance Comparison
Model | Precision | Size | Latency | Accuracy Drop |
---|---|---|---|---|
LLaMA-2-70B | FP16 | 140 GB | 1.0x | Baseline |
LLaMA-2-70B | INT8 | 70 GB | 1.7x | -0.2% |
LLaMA-2-70B | INT4 | 35 GB | 2.8x | -0.8% |
LLaMA-2-70B | GPTQ-INT4 | 35 GB | 2.8x | -0.5% |
Implementation Tools
Several frameworks make quantization more accessible:
- 🤗 Transformers & bitsandbytes: Simple quantization API
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b",
device_map="auto",
load_in_8bit=True # Enable INT8 quantization
)
- ONNX Runtime: Optimized INT8 quantization for production
import onnxruntime as ort
quantized_model = ort.quantize_dynamic(
model_path,
quantized_model_path,
weight_type=onnx.QuantType.QInt8
)
- vLLM: Efficient INT8/FP16 inference engine
from vllm import LLM
model = LLM(
model="meta-llama/Llama-2-70b",
quantization="awq", # Apply AWQ 4-bit quantization
dtype="half" # Use FP16 for other operations
)
Practical Tips for Quantizing LLMs
- Start with weight-only quantization - Simpler implementation with good results
- Test different bit-widths - Try 8-bit first, then 4-bit if needed
- Compare PTQ techniques - AWQ, GPTQ, and standard methods have different trade-offs
- Evaluate on domain-specific tasks - Quantization effects vary by use case
- Benchmark latency & throughput - Verify real-world performance improvements
Learn more on how Matter AI helps improve code quality across multiple languages in Pull Requests: https://docs.matterai.dev/product/code-quality
Are you looking for a way to improve your code review process? Learn more on how Matter AI helps team to solve code review challenges with AI: https://matterai.so
Share this Article:
More Articles

Understanding Attention: Coherency in LLMs
How LLMs generate coherent text across long contexts

LLM Tokenisation fundamentals and working
What is LLM Tokenisation and how it works

Understanding LLM Context Window and Working
What is LLM Context Window and how it works

LLM Prompt Caching
What is LLM Prompt Caching and how it can help reduce LLM cost

How Matter AI brings Velocity, Cost Optimization and Governance to Engineering Teams
Dive into what and how MatterAI offers to engineering teams