Size Matters: LLM Quantization Techniques (Part 1)
Hello World!
Table of Contents
- Introduction
- Types of Quantization
- Common Data Types of the Parameters
- Quantization Techniques
- Further Reading
TL;DR This blog post introduces LLM quantization techniques, explaining how they reduce model size and improve efficiency. It covers types of quantization (PTQ and QAT), common data types for parameters, and various quantization methods (GGUF, GPTQ, AWQ, EXL2) with their pros and cons. The post emphasizes that the choice of technique depends on the specific use case.
Introduction
This is part 1 of the three part series covering LLM1 Quantization. Today, We'll be exploring the most common quantization techniques along with their pros and cons. In the upcoming blogs, we'll be practically implementing these techniques and finally, understanding how Quantization works in-depth.
Note that the Quantization techniques mentioned here are the most common and up-to-date ones as of 3rd July 2024, they are subject to change in the future in this rapidly advancing field
LLMs, though powerful, require a significant amount of compute and size making it difficult for deployment in resource-constrained2 devices. Quantization refers to the process of converting your model from a higher memory format into a lower memory format.
Think of it like compressing a file or, in the case of a LLM, rounding off the parameters3 (weights) of the model, which drastically reduces the size of the parameters. In turn, this makes the model smaller and more power-efficient.
Types of Quantization
There are primarily two types of quantization techniques: PTQ (Post-Training Quantization) and QAT (Quantization-Aware Training). While PTQ aims to apply quantization to an already trained model, QAT, on the other hand, integrates quantization directly into the model training process.
Aspect | PTQ | QAT |
---|---|---|
When Applied | After training | During training |
Implementation | Simpler | More complex |
Accuracy | Good for minor precision reduction | Better for aggressive quantization |
Time to Apply | Quick | Longer (requires training) |
Flexibility | Can be applied to any model | Needs to be incorporated from the start |
It's worth noting that while the techniques we'll be discussing later (GGUF, GPTQ, AWQ, EXL2) are primarily PTQ methods, some of these can be adapted or combined with QAT principles for improved results.
Common Data Types of the Parameters
- FP32: Full precision which stores the parameters in floating points4 32 bits5.
- FP16: Half precision which stores the parameters in floating points 16 bits.
- BP16: Drop in replacement for FP32 which also stores the parameters in floating points 16 bits (most commonly used in Deep Learning tasks).
- Int8: Quantized which stores the parameters in integers 8 bits.
- Int4: Quantized which stores the parameters in integers 4 bits.
A lot of models use 16bit (2bytes) floating point numbers for the bulk of their weights, so for a model like Llama2-7B6 => 7,000,000,000 parameters * 2 bytes = 14,000,000,000 bytes = 14 gigabytes of memory
Precision | Bits | Bytes (bits / 8) | Size (bytes * parameters) | Accuracy | Example Value |
---|---|---|---|---|---|
FP32 | 32 | 32 / 8 = 4 | 4 x 7 = 28 GB | Highest | 3.1415927 |
FP16 | 16 | 16 / 8 = 2 | 2 x 7 = 14 GB | Moderate | 3.141 |
BF16 | 16 | 16 / 8 = 2 | 2 x 7 = 14 GB | High | 3.140625 |
Int8 (Q8) | 8 | 8 / 8 = 1 | 1 x 7 = 7 GB | Low | 3 |
Int4 (Q4) | 4 | 4 / 8 = 0.5 | 0.5 x 7 = 3.5 GB | Lowest | 3 |
Note how the quantized versions require less size and memory at the cost of accuracy
Quantization Techniques
Just how we have different image formats like PNG, JPEG, WEBP and so on for representing Images. Similarly, we have different quantization techniques. Here we should note that no technique is superior to another, it entirely depends on the use case.
Method | Focus | Implemented Library | Quantization Range | Ideal Usage |
---|---|---|---|---|
GGUF (formerly GGML) | CPU inference7 + GPU offloading8 | llama.cpp | 2-bit to 8-bit | CPU-only systems, limited VRAM9 scenarios |
GPTQ | GPU inference | AutoGPTQ | 2, 3, 4, 8-bit | Balances performance and VRAM usage for modest GPUs. |
AWQ | Model quality | AutoAWQ | 4-bit | suitable for systems with ample VRAM. |
EXL2 | Fast inference | ExLlamaV2 | Varying bit rates | Maximum speed with sufficient VRAM |
NF4 and Int8 Quantization | Ease of use | bitsandbytes | 4, 8-bit | Prototyping, Limited Hardware |
llama.cpp (GGUF) is the most popular inference library due to it's wide variety of quantization techniques and support for Metal Apple Silicon. We'll cover setting up and inferencing with llama.cpp in another blog.
Exercise for the reader
- Research about GGUF, GPTQ, AWQ, EXL2 and NF4. We'll cover this in a future blog
- Play around with LLMs using LM Studio
- Try running the BF16 google/gemma-2-9b-it: Github Link
Further Reading
If you're interested in running quantized large language models yourself, we recommend reading our blog post on Google Colab and it's alternatives.
Footnotes:
Large Language Models like ChatGPT↩
Edge devices like mobile phones↩
Parameters refers to the weights of the model, directly related to the memory consumption.↩
Floating points refer to decimal numbers↩
Binary format i.e. 0s or 1s [1 Byte = 8 bits]↩
A model by Meta, The 7B denotes the 7 billion parameters↩
Inference is the process of getting a response from the LLM.↩
GPU offloading is process of transferring large computational problems to a GPU because they run faster on the GPU than on the CPU.↩
Video RAM is a type of memory in a graphics processing unit (GPU)↩