Size Matters: LLM Quantization Techniques (Part 1)

02 Jul, 2024

Hello World!

Introduction
Types of Quantization
Common Data Types of the Parameters
Quantization Techniques
Further Reading

TL;DR This blog post introduces LLM quantization techniques, explaining how they reduce model size and improve efficiency. It covers types of quantization (PTQ and QAT), common data types for parameters, and various quantization methods (GGUF, GPTQ, AWQ, EXL2) with their pros and cons. The post emphasizes that the choice of technique depends on the specific use case.

Introduction

This is part 1 of the three part series covering LLM¹ Quantization. Today, We'll be exploring the most common quantization techniques along with their pros and cons. In the upcoming blogs, we'll be practically implementing these techniques and finally, understanding how Quantization works in-depth.

Note that the Quantization techniques mentioned here are the most common and up-to-date ones as of 3rd July 2024, they are subject to change in the future in this rapidly advancing field

LLMs, though powerful, require a significant amount of compute and size making it difficult for deployment in resource-constrained² devices. Quantization refers to the process of converting your model from a higher memory format into a lower memory format.

Think of it like compressing a file or, in the case of a LLM, rounding off the parameters³ (weights) of the model, which drastically reduces the size of the parameters. In turn, this makes the model smaller and more power-efficient.

Types of Quantization

There are primarily two types of quantization techniques: PTQ (Post-Training Quantization) and QAT (Quantization-Aware Training). While PTQ aims to apply quantization to an already trained model, QAT, on the other hand, integrates quantization directly into the model training process.

Aspect	PTQ	QAT
When Applied	After training	During training
Implementation	Simpler	More complex
Accuracy	Good for minor precision reduction	Better for aggressive quantization
Time to Apply	Quick	Longer (requires training)
Flexibility	Can be applied to any model	Needs to be incorporated from the start

It's worth noting that while the techniques we'll be discussing later (GGUF, GPTQ, AWQ, EXL2) are primarily PTQ methods, some of these can be adapted or combined with QAT principles for improved results.

Common Data Types of the Parameters

FP32: Full precision which stores the parameters in floating points⁴ 32 bits⁵.
FP16: Half precision which stores the parameters in floating points 16 bits.
BP16: Drop in replacement for FP32 which also stores the parameters in floating points 16 bits (most commonly used in Deep Learning tasks).
Int8: Quantized which stores the parameters in integers 8 bits.
Int4: Quantized which stores the parameters in integers 4 bits.

A lot of models use 16bit (2bytes) floating point numbers for the bulk of their weights, so for a model like Llama2-7B⁶ => 7,000,000,000 parameters * 2 bytes = 14,000,000,000 bytes = 14 gigabytes of memory

Precision	Bits	Bytes (bits / 8)	Size (bytes * parameters)	Accuracy	Example Value
FP32	32	32 / 8 = 4	4 x 7 = 28 GB	Highest	3.1415927
FP16	16	16 / 8 = 2	2 x 7 = 14 GB	Moderate	3.141
BF16	16	16 / 8 = 2	2 x 7 = 14 GB	High	3.140625
Int8 (Q8)	8	8 / 8 = 1	1 x 7 = 7 GB	Low	3
Int4 (Q4)	4	4 / 8 = 0.5	0.5 x 7 = 3.5 GB	Lowest	3

Note how the quantized versions require less size and memory at the cost of accuracy

Quantization Techniques

Just how we have different image formats like PNG, JPEG, WEBP and so on for representing Images. Similarly, we have different quantization techniques. Here we should note that no technique is superior to another, it entirely depends on the use case.

Method	Focus	Implemented Library	Quantization Range	Ideal Usage
GGUF (formerly GGML)	CPU inference⁷ + GPU offloading⁸	llama.cpp	2-bit to 8-bit	CPU-only systems, limited VRAM⁹ scenarios
GPTQ	GPU inference	AutoGPTQ	2, 3, 4, 8-bit	Balances performance and VRAM usage for modest GPUs.
AWQ	Model quality	AutoAWQ	4-bit	suitable for systems with ample VRAM.
EXL2	Fast inference	ExLlamaV2	Varying bit rates	Maximum speed with sufficient VRAM
NF4 and Int8 Quantization	Ease of use	bitsandbytes	4, 8-bit	Prototyping, Limited Hardware

llama.cpp (GGUF) is the most popular inference library due to it's wide variety of quantization techniques and support for Metal Apple Silicon. We'll cover setting up and inferencing with llama.cpp in another blog.

Exercise for the reader

Research about GGUF, GPTQ, AWQ, EXL2 and NF4. We'll cover this in a future blog

Play around with LLMs using LM Studio

Try running the BF16 google/gemma-2-9b-it: Github Link

Noumaan

Size Matters: LLM Quantization Techniques (Part 1)

Table of Contents

Introduction

Types of Quantization

Common Data Types of the Parameters

Quantization Techniques

Further Reading