Quantization formats have multiplied to the point where picking a model download feels like decoding a serial number. GGUF, GPTQ, AWQ, EXL2, Q4_K_M, Q5_K_S – none of these names tell you what you actually need to know.
This guide skips the theory and focuses on the practical decision: which format to download for your hardware and tool.
—
What quantization does to a model
LLM weights are stored as numbers. At full precision (FP32 or BF16), each weight takes 2-4 bytes. A 7B parameter model in full precision requires around 14GB just for the weights, before accounting for context and computation overhead.
Quantization reduces the number of bits used to represent each weight. Q4 uses 4 bits per weight instead of 16. This cuts the 7B model down to roughly 4GB – small enough to fit in many consumer GPUs and run on a MacBook.
The tradeoff: lower bit depth means slightly lower quality. The model loses some precision in its weights, which translates to slightly less coherent output on complex tasks. For most everyday use, Q4 and Q5 are indistinguishable from full precision. For demanding tasks like long-form reasoning or complex code, the gap is more noticeable but still modest.
The three things quantization affects:
- Model size on disk and in memory – dramatically reduced
- Inference speed – faster because less data moves through memory per token
- Output quality – slightly reduced, with the penalty increasing at lower bit depths
—
GGUF, GPTQ, AWQ: what each format actually is
GGUF is the format used by llama.cpp and everything built on it – Ollama, LM Studio, Jan, and most local AI tools. It supports CPU inference, GPU offloading (running part of the model on GPU and part in system RAM), and Apple Silicon. GGUF files contain everything needed to run the model in a single file.
This is the format most people should use. If you are running Ollama or LM Studio, you are using GGUF whether you know it or not.
GPTQ is a GPU-only format designed for efficient inference on NVIDIA hardware using the transformers and auto-gptq libraries. It produces smaller files than GGUF at equivalent quality for GPU use cases. The limitation: it requires a CUDA-capable GPU and does not support CPU fallback or Apple Silicon. If the model does not fit entirely in GPU VRAM, GPTQ becomes awkward.
AWQ (Activation-aware Weight Quantization) is a newer GPU format that generally produces better quality than GPTQ at the same bit depth, particularly for instruction-following tasks. Supported by vLLM, text-generation-webui, and other GPU-focused inference frameworks. Same limitation as GPTQ: NVIDIA GPU required, no CPU fallback.
EXL2 is a format used with ExLlamaV2, notable for very fast inference on NVIDIA GPUs. Niche but worth knowing if you are running a high-throughput local server on a powerful GPU.
Summary:
- Running Ollama, LM Studio, or Jan on any hardware: use GGUF
- Running a GPU inference server on NVIDIA hardware: GPTQ or AWQ (prefer AWQ for quality)
- Maximum throughput on NVIDIA: ExLlamaV2 with EXL2
- CPU or Apple Silicon: GGUF only
—
Q4 vs Q5 vs Q8: practical quality differences
Within GGUF there are many quantization levels. The naming can be confusing because Ollama and llama.cpp have evolved their own shorthand. The ones you will encounter most:
Q4_K_M – 4-bit quantization using K-quants (a smarter allocation of bits across different layer types). The “M” means medium-sized calibration. This is the most widely used GGUF variant and the right default for most use cases. Good quality, reasonable size, runs fast.
Q5_K_M – 5-bit K-quants. Noticeably better quality than Q4_K_M on tasks that stress reasoning and instruction following. About 20-25% larger file size. Use this if you have the VRAM or RAM to spare and quality matters for your use case.
Q8_0 – 8-bit quantization. Near-indistinguishable from full precision in most tests. File size is roughly half of FP16. Use this when you want maximum quality, have the memory for it, and are willing to trade some speed for fidelity. Good for embedding models where quality consistency matters.
Q2_K and Q3_K – aggressively quantized for maximum compression. Noticeable quality degradation, especially on complex tasks. Useful only on extremely memory-constrained hardware. Avoid unless you have no other option.
IQ2 and IQ3 – newer importance-matrix quantization variants at 2-3 bit depth. Better quality than Q2_K/Q3_K at similar sizes. Still not great for demanding tasks, but a step up when memory is the hard constraint.
Practical guidance:
- Default choice: Q4_K_M
- Have extra memory and care about quality: Q5_K_M
- Maximum quality, memory available: Q8_0
- Memory is tight: Q4_K_S (slightly smaller than Q4_K_M, small quality cost)
- Embedding models: Q8_0 or full precision if it fits
—
Tool compatibility narrows your choices quickly
Before worrying about quality tradeoffs, check what your tool supports. This filters the decision faster than anything else.
Ollama: GGUF only. Handles Q4_K_M and Q5_K_M variants natively for most popular models. The ollama pull command downloads the default quantization the maintainer chose, usually Q4_K_M. To use a different quantization, you typically need to import a GGUF file manually.
LM Studio: GGUF only. The model browser lets you select from available quantization levels per model. Easiest way to experiment with different quant levels without command-line work.
text-generation-webui: Supports GGUF, GPTQ, AWQ, and EXL2 depending on which backend you enable. The most flexible option if you want to compare formats.
vLLM: Primarily AWQ and GPTQ for quantized inference. Designed for high-throughput server deployments, not personal use.
Jan: GGUF only, similar to Ollama in format support.
Tabby (coding assistant server): GGUF with llama.cpp backend, also supports GPTQ on NVIDIA.
If you are not sure which format to pick, ask: what tool am I using? The answer is almost always GGUF.
—
Quick decision guide
For the vast majority of local AI setups:
Use GGUF. Unless you are running a dedicated NVIDIA GPU server with specific throughput requirements, GGUF is correct.
Start with Q4_K_M. It is the community default for a reason – good balance across quality, size, and speed.
Try Q5_K_M if quality matters. If you are doing tasks where you notice the model losing coherence or making more mistakes, stepping up to Q5 often helps and the size increase is manageable.
Use Q8_0 for embedding models. Embedding quality consistency benefits more from precision than generation tasks do.
Ignore GPTQ and AWQ unless you are specifically building a GPU inference server on NVIDIA hardware.
What quantization level are you running, and have you noticed quality differences between Q4 and Q5 on specific task types? Curious whether others have found particular models where the quant level makes a bigger than expected difference.