GPU vs CPU for local AI: when the hardware gap actually matters

Tomas

GPU is faster than CPU for local AI. That part is not in dispute. What matters is by how much, for which models, and whether the difference is large enough to affect your actual experience.

For some setups the gap is the difference between usable and unusable. For others it is the difference between fast and very fast, which may not be worth the cost of upgrading.

—

Token generation speed: real numbers

These are approximate real-world speeds using Ollama on common hardware in 2026, Q4 quantization unless noted.

Llama 3.2 3B (Q4):

CPU (modern 8-core, 32GB RAM): 18-28 tokens/sec
RTX 4060 (8GB VRAM): 80-100 tokens/sec
RTX 4090 (24GB VRAM): 150-180 tokens/sec
M3 Pro 18GB: 55-70 tokens/sec

Llama 3.1 8B (Q4):

CPU (modern 8-core, 32GB RAM): 8-14 tokens/sec
RTX 4060 (8GB VRAM): 55-70 tokens/sec
RTX 4090 (24GB VRAM): 100-120 tokens/sec
M3 Pro 18GB: 40-55 tokens/sec

Mistral 7B (Q4):

CPU (modern 8-core, 32GB RAM): 10-15 tokens/sec
RTX 4060 (8GB VRAM): 60-75 tokens/sec
RTX 4090 (24GB VRAM): 110-130 tokens/sec
M3 Max 36GB: 70-90 tokens/sec

Llama 3.1 70B (Q4) – GPU-only viable at interactive speeds:

CPU only: 1-3 tokens/sec (unusable for chat)
RTX 4090 (24GB VRAM, partial offload): 15-25 tokens/sec
Dual RTX 3090 (48GB VRAM): 30-45 tokens/sec
M3 Max 128GB: 20-35 tokens/sec

The CPU numbers are not a joke. A modern multi-core CPU running a small quantized model produces perfectly readable output. At 10-15 tokens/sec, a 200-token response arrives in 13-20 seconds. Slow for conversation, but usable for many tasks.

—

The crossover point: where CPU becomes impractical

The crossover is not at a specific model size. It depends on what you are doing with the model.

For interactive chat (back-and-forth conversation), the minimum comfortable speed is around 15-20 tokens/sec. Below that, reading the response as it streams becomes annoying and the rhythm of conversation breaks. This rules out CPU for models above roughly 13B parameters on most hardware.

For coding assistance (where you submit a prompt and wait), 8-10 tokens/sec is tolerable. You submit the task, do something else briefly, come back to the output. CPU stays viable up to around 20-30B with this workflow.

For batch processing (summarizing a folder of documents, generating embeddings, processing a queue overnight), tokens per second barely matters. What matters is total throughput over hours. CPU is perfectly adequate for batch jobs – you set it running and wait.

The pattern: interactive use needs GPU faster than any other use case. Everything else is more forgiving.

—

Apple Silicon: the special case

M-series Macs do not fit neatly into the GPU vs CPU framing because the architecture is different.

Apple Silicon uses unified memory shared between CPU and GPU. This means a model that is too large to fit in a discrete GPU’s VRAM can still run entirely in fast unified memory on an M-series chip. A 36GB M3 Max can run a 32B parameter model at Q4 with no VRAM bottleneck because the 36GB is available to both compute units simultaneously.

Practical result: M-series Macs punch significantly above their GPU-equivalent spec for local AI. An M3 Max with 36GB unified memory runs models that would require a high-end discrete GPU setup with comparable or better throughput, at lower power consumption and without the multi-GPU complexity.

Where M-series loses: raw token generation speed on models that fit comfortably in VRAM. An RTX 4090 running a 7B model at Q4 will outrun an M3 Max on that specific task. The M-series advantage appears at larger model sizes where the discrete GPU runs out of VRAM and has to partially offload to system RAM, which is much slower.

Recommendation by use case:

Primarily running 7-13B models: a discrete GPU with sufficient VRAM is faster
Running 30B+ models locally: M3 Max or M2 Ultra with 32GB+ unified memory is often the better option
Mixed use across model sizes: M-series handles the range more gracefully

—

When GPU matters less than you expect

Four situations where the GPU upgrade does not change much:

Small models for simple tasks. A 3B model for classification, tagging, or short extraction runs fast enough on CPU that adding a GPU cuts response time from 4 seconds to 1.5 seconds. That is not nothing, but it is not transformative either.

Batch jobs running overnight. If you are processing 500 documents and you have 8 hours, total throughput matters more than per-request speed. CPU handles this fine.

Background agents with long intervals. An agent that checks something every 30 minutes and generates a short summary does not need fast inference. The bottleneck is the schedule, not the hardware.

Embedding generation. Embedding models are smaller than generation models. nomic-embed-text runs at adequate speed on CPU for indexing up to tens of thousands of documents. You do not need a GPU for the embedding part of a RAG pipeline unless the corpus is very large.

The real GPU value case: interactive chat and coding assistance with 7B+ models, where latency directly affects how natural the interaction feels. That is where the hardware difference changes the experience.

—

Practical upgrade decision

Before buying hardware, answer these questions:

What model sizes do you actually want to run? Models up to 7B run acceptably on CPU. Above 13B, GPU or Apple Silicon becomes important for interactive use.
What is your primary use case? Batch processing and non-interactive workflows tolerate CPU speeds. Conversational use does not.
What VRAM does the GPU you are considering have? The GPU speed advantage disappears if the model does not fit in VRAM and has to offload to RAM. An RTX 4060 with 8GB runs 7B models well but struggles with 13B. An RTX 4090 with 24GB handles up to 30B comfortably.
Are you on Apple Silicon already? If you have an M3 Pro with 18GB or better, you likely do not need a discrete GPU upgrade – the unified memory architecture handles mid-size models well.

The honest answer for most people: if you are running 7B models for chat or coding, a mid-range GPU (RTX 4060 Ti 16GB or similar) makes the experience noticeably better and costs less than you might expect. If you are running batch jobs or small models, the CPU you already have is probably sufficient.

What hardware are you running locally and which models? Curious whether others have found the CPU more capable than expected for specific tasks.