Noumaan

Size Matters: LLM Quantization Techniques (Part 1)

Hello World!

Table of Contents

  1. Introduction
  2. Types of Quantization
  3. Common Data Types of the Parameters
  4. Quantization Techniques
  5. Further Reading

TL;DR This blog post introduces LLM quantization techniques, explaining how they reduce model size and improve efficiency. It covers types of quantization (PTQ and QAT), common data types for parameters, and various quantization methods (GGUF, GPTQ, AWQ, EXL2) with their pros and cons. The post emphasizes that the choice of technique depends on the specific use case.

Introduction

This is part 1 of the three part series covering LLM1 Quantization. Today, We'll be exploring the most common quantization techniques along with their pros and cons. In the upcoming blogs, we'll be practically implementing these techniques and finally, understanding how Quantization works in-depth.

Note that the Quantization techniques mentioned here are the most common and up-to-date ones as of 3rd July 2024, they are subject to change in the future in this rapidly advancing field

LLMs, though powerful, require a significant amount of compute and size making it difficult for deployment in resource-constrained2 devices. Quantization refers to the process of converting your model from a higher memory format into a lower memory format.

Think of it like compressing a file or, in the case of a LLM, rounding off the parameters3 (weights) of the model, which drastically reduces the size of the parameters. In turn, this makes the model smaller and more power-efficient.


Types of Quantization

There are primarily two types of quantization techniques: PTQ (Post-Training Quantization) and QAT (Quantization-Aware Training). While PTQ aims to apply quantization to an already trained model, QAT, on the other hand, integrates quantization directly into the model training process.

Aspect PTQ QAT
When Applied After training During training
Implementation Simpler More complex
Accuracy Good for minor precision reduction Better for aggressive quantization
Time to Apply Quick Longer (requires training)
Flexibility Can be applied to any model Needs to be incorporated from the start

It's worth noting that while the techniques we'll be discussing later (GGUF, GPTQ, AWQ, EXL2) are primarily PTQ methods, some of these can be adapted or combined with QAT principles for improved results.

Common Data Types of the Parameters

  1. FP32: Full precision which stores the parameters in floating points4 32 bits5.
  2. FP16: Half precision which stores the parameters in floating points 16 bits.
  3. BP16: Drop in replacement for FP32 which also stores the parameters in floating points 16 bits (most commonly used in Deep Learning tasks).
  4. Int8: Quantized which stores the parameters in integers 8 bits.
  5. Int4: Quantized which stores the parameters in integers 4 bits.

A lot of models use 16bit (2bytes) floating point numbers for the bulk of their weights, so for a model like Llama2-7B6 => 7,000,000,000 parameters * 2 bytes = 14,000,000,000 bytes = 14 gigabytes of memory

Precision Bits Bytes (bits / 8) Size (bytes * parameters) Accuracy Example Value
FP32 32 32 / 8 = 4 4 x 7 = 28 GB Highest 3.1415927
FP16 16 16 / 8 = 2 2 x 7 = 14 GB Moderate 3.141
BF16 16 16 / 8 = 2 2 x 7 = 14 GB High 3.140625
Int8 (Q8) 8 8 / 8 = 1 1 x 7 = 7 GB Low 3
Int4 (Q4) 4 4 / 8 = 0.5 0.5 x 7 = 3.5 GB Lowest 3

Note how the quantized versions require less size and memory at the cost of accuracy

Quantization Techniques

Just how we have different image formats like PNG, JPEG, WEBP and so on for representing Images. Similarly, we have different quantization techniques. Here we should note that no technique is superior to another, it entirely depends on the use case.

Method Focus Implemented Library Quantization Range Ideal Usage
GGUF (formerly GGML) CPU inference7 + GPU offloading8 llama.cpp 2-bit to 8-bit CPU-only systems, limited VRAM9 scenarios
GPTQ GPU inference AutoGPTQ 2, 3, 4, 8-bit Balances performance and VRAM usage for modest GPUs.
AWQ Model quality AutoAWQ 4-bit suitable for systems with ample VRAM.
EXL2 Fast inference ExLlamaV2 Varying bit rates Maximum speed with sufficient VRAM
NF4 and Int8 Quantization Ease of use bitsandbytes 4, 8-bit Prototyping, Limited Hardware

llama.cpp (GGUF) is the most popular inference library due to it's wide variety of quantization techniques and support for Metal Apple Silicon. We'll cover setting up and inferencing with llama.cpp in another blog.

Exercise for the reader

  1. Research about GGUF, GPTQ, AWQ, EXL2 and NF4. We'll cover this in a future blog
  2. Play around with LLMs using LM Studio
  3. Try running the BF16 google/gemma-2-9b-it: Github Link

Further Reading

If you're interested in running quantized large language models yourself, we recommend reading our blog post on Google Colab and it's alternatives.


Footnotes:

  1. Large Language Models like ChatGPT

  2. Edge devices like mobile phones

  3. Parameters refers to the weights of the model, directly related to the memory consumption.

  4. Floating points refer to decimal numbers

  5. Binary format i.e. 0s or 1s [1 Byte = 8 bits]

  6. A model by Meta, The 7B denotes the 7 billion parameters

  7. Inference is the process of getting a response from the LLM.

  8. GPU offloading is process of transferring large computational problems to a GPU because they run faster on the GPU than on the CPU.

  9. Video RAM is a type of memory in a graphics processing unit (GPU)

#guide #llms