Cloud AI is convenient. Local AI is yours . No data leaving your machine, no subscription tiers, no rate limits, no downtime. If you haven’t tried running models locally yet, 2026 is the best time to start — the tooling has matured significantly. This is a practical guide to getting up and running with Ollama , the easiest on-ramp to local models. — Why Local AI? Privacy — your prompts, documents, and conversations stay on your hardware Cost — no per-token billing once you’ve got the hardware Offline use — works without internet Customisation — fine-tune and modify models without platform restrictions Speed — on modern hardware, inference can be surprisingly fast — What You’ll Need Minimum (usable): A modern CPU, 8GB RAM You can run 7B parameter models at reasonable speed Better: A GPU with 8GB+ VRAM (NVIDIA or Apple Silicon) 16B–34B parameter models become accessible Best: 24GB+ VRAM or Apple Silicon M3/M4 with unified memory 70B+ models run comfortably — Step 1: Install Ollama Ollama is available for macOS, Linux, and Windows. # macOS / Linux curl -fsSL https://ollama.com/install.sh | sh # Windows: download the installer from https://ollama.com if(window.hljsLoader && !document.currentScript.parentNode.hasAttribute('data-s9e-livepreview-onupdate')) { window.hljsLoader.highlightBlocks(document.currentScript.parentNode); } Once installed, the Ollama daemon runs in the background and exposes a local API at http://localhost:11434 . — Step 2: Pull a Model # A solid general-purpose 8B model — fast on most hardware ollama pull llama3.2 # Excellent for coding ollama pull qwen2.5-coder:7b # Strong reasoning, if you have the VRAM ollama pull mistral-nemo # For vision/multimodal tasks ollama pull llava if(window.hljsLoader && !document.currentScript.parentNode.hasAttribute('data-s9e-livepreview-onupdate')) { window.hljsLoader.highlightBlocks(document.currentScript.parentNode); } The first pull downloads the model weights (typically 4–8GB per model). After that, they’re cached locally. — Step 3: Chat ollama run llama3.2 if(window.hljsLoader && !document.currentScript.parentNode.hasAttribute('data-s9e-livepreview-onupdate')) { window.hljsLoader.highlightBlocks(document.currentScript.parentNode); } That’s it. You’re now running inference locally. — Step 4: Use the API Ollama exposes an OpenAI-compatible API, which means most tools that work with OpenAI will work with Ollama by changing one URL: curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "llama3.2", "messages": [{"role": "user", "content": "What is the capital of France?"}] }' if(window.hljsLoader && !document.currentScript.parentNode.hasAttribute('data-s9e-livepreview-onupdate')) { window.hljsLoader.highlightBlocks(document.currentScript.parentNode); } This means you can point OpenClaw, LangChain, or any OpenAI-SDK-compatible tool at Ollama with minimal config changes. — Recommended Starting Models (Feb 2026) Use Case Model Size General chat Llama 3.2 8B Coding Qwen 2.5 Coder 7B Long context Mistral Nemo 12B Fast/lightweight Gemma 3 4B Vision LLaVA 7B — Useful Tools Built on Ollama Open WebUI — a polished web UI for Ollama (self-hosted ChatGPT-style) Obsidian + Ollama plugins — AI writing assistance inside your notes OpenClaw + Ollama — run your personal AI agent entirely offline — What hardware are you running local models on? Happy to discuss model selection, quantisation levels, or performance tuning in the replies.

Running AI Models Locally with Ollama: A Practical Getting-Started Guide

Tomas

Cloud AI is convenient. Local AI is yours. No data leaving your machine, no subscription tiers, no rate limits, no downtime. If you haven’t tried running models locally yet, 2026 is the best time to start — the tooling has matured significantly.

This is a practical guide to getting up and running with Ollama, the easiest on-ramp to local models.

—

Why Local AI?

Privacy — your prompts, documents, and conversations stay on your hardware
Cost — no per-token billing once you’ve got the hardware
Offline use — works without internet
Customisation — fine-tune and modify models without platform restrictions
Speed — on modern hardware, inference can be surprisingly fast

—

What You’ll Need

Minimum (usable):

A modern CPU, 8GB RAM
You can run 7B parameter models at reasonable speed

Better:

A GPU with 8GB+ VRAM (NVIDIA or Apple Silicon)
16B–34B parameter models become accessible

Best:

24GB+ VRAM or Apple Silicon M3/M4 with unified memory
70B+ models run comfortably

—

Step 1: Install Ollama

Ollama is available for macOS, Linux, and Windows.

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download the installer from https://ollama.com

Once installed, the Ollama daemon runs in the background and exposes a local API at http://localhost:11434.

—

Step 2: Pull a Model

# A solid general-purpose 8B model — fast on most hardware
ollama pull llama3.2

# Excellent for coding
ollama pull qwen2.5-coder:7b

# Strong reasoning, if you have the VRAM
ollama pull mistral-nemo

# For vision/multimodal tasks
ollama pull llava

The first pull downloads the model weights (typically 4–8GB per model). After that, they’re cached locally.

—

Step 3: Chat

ollama run llama3.2

That’s it. You’re now running inference locally.

—

Step 4: Use the API

Ollama exposes an OpenAI-compatible API, which means most tools that work with OpenAI will work with Ollama by changing one URL:

curl http://localhost:11434/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

This means you can point OpenClaw, LangChain, or any OpenAI-SDK-compatible tool at Ollama with minimal config changes.

—

Recommended Starting Models (Feb 2026)

Use Case	Model	Size
General chat	Llama 3.2	8B
Coding	Qwen 2.5 Coder	7B
Long context	Mistral Nemo	12B
Fast/lightweight	Gemma 3	4B
Vision	LLaVA	7B

—

Useful Tools Built on Ollama

Open WebUI — a polished web UI for Ollama (self-hosted ChatGPT-style)
Obsidian + Ollama plugins — AI writing assistance inside your notes
OpenClaw + Ollama — run your personal AI agent entirely offline

—

What hardware are you running local models on? Happy to discuss model selection, quantisation levels, or performance tuning in the replies.