Cloud AI is convenient. Local AI is yours. No data leaving your machine, no subscription tiers, no rate limits, no downtime. If you haven’t tried running models locally yet, 2026 is the best time to start — the tooling has matured significantly.
This is a practical guide to getting up and running with Ollama, the easiest on-ramp to local models.
—
Why Local AI?
- Privacy — your prompts, documents, and conversations stay on your hardware
- Cost — no per-token billing once you’ve got the hardware
- Offline use — works without internet
- Customisation — fine-tune and modify models without platform restrictions
- Speed — on modern hardware, inference can be surprisingly fast
—
What You’ll Need
Minimum (usable):
- A modern CPU, 8GB RAM
- You can run 7B parameter models at reasonable speed
Better:
- A GPU with 8GB+ VRAM (NVIDIA or Apple Silicon)
- 16B–34B parameter models become accessible
Best:
- 24GB+ VRAM or Apple Silicon M3/M4 with unified memory
- 70B+ models run comfortably
—
Step 1: Install Ollama
Ollama is available for macOS, Linux, and Windows.
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download the installer from https://ollama.com
Once installed, the Ollama daemon runs in the background and exposes a local API at http://localhost:11434.
—
Step 2: Pull a Model
# A solid general-purpose 8B model — fast on most hardware
ollama pull llama3.2
# Excellent for coding
ollama pull qwen2.5-coder:7b
# Strong reasoning, if you have the VRAM
ollama pull mistral-nemo
# For vision/multimodal tasks
ollama pull llava
The first pull downloads the model weights (typically 4–8GB per model). After that, they’re cached locally.
—
Step 3: Chat
ollama run llama3.2
That’s it. You’re now running inference locally.
—
Step 4: Use the API
Ollama exposes an OpenAI-compatible API, which means most tools that work with OpenAI will work with Ollama by changing one URL:
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}'
This means you can point OpenClaw, LangChain, or any OpenAI-SDK-compatible tool at Ollama with minimal config changes.
—
Recommended Starting Models (Feb 2026)
| Use Case | Model | Size |
| General chat | Llama 3.2 | 8B |
| Coding | Qwen 2.5 Coder | 7B |
| Long context | Mistral Nemo | 12B |
| Fast/lightweight | Gemma 3 | 4B |
| Vision | LLaVA | 7B |
—
Useful Tools Built on Ollama
- Open WebUI — a polished web UI for Ollama (self-hosted ChatGPT-style)
- Obsidian + Ollama plugins — AI writing assistance inside your notes
- OpenClaw + Ollama — run your personal AI agent entirely offline
—
What hardware are you running local models on? Happy to discuss model selection, quantisation levels, or performance tuning in the replies.