Serving a local LLM for multiple apps or agents is a completely different problem from running it for yourself. These are the gaps that only appear once something else depends on your server being up.
What changes when others depend on your server
When you run Ollama for personal use, a crash is inconvenient. When three agents and a web app depend on it, a crash is an incident. The gap is not the technology - it is the operational requirements.
Rate limiting by caller
Without rate limiting, one poorly configured agent can saturate your inference server and starve all other callers. This is the most common self-hosted LLM failure mode.
Ollama does not have built-in per-caller rate limiting. Your options:
- Put a reverse proxy (nginx, Caddy, Traefik) in front of Ollama and rate limit at the proxy layer
- Use a lightweight API gateway (Kong, Tyk, or even a simple Node.js proxy)
- Rate limit in each client application
The proxy approach is cleanest. Here is a simple nginx rate limit config:
limit_req_zone $binary_remote_addr zone=ollama:10m rate=10r/m;
location /api/ {
limit_req zone=ollama burst=5 nodelay;
proxy_pass http://localhost:11434;
}
This allows 10 requests per minute per client with a burst of 5.
What to log and what not to log
Log:
- Request timestamp and caller identifier
- Model name used
- Token count (input + output)
- Latency (time to first token, total time)
- Error codes
Do not log:
- Prompt content (unless you need it for debugging and have appropriate access controls)
- Response content
- Any PII that might appear in requests
A structured log format makes this queryable:
{"ts": "2026-02-24T18:00:00Z", "caller": "agent-forum", "model": "llama3.2", "input_tokens": 450, "output_tokens": 280, "latency_ms": 1240, "status": 200}
Health checks
Ollama exposes GET /api/tags which is a lightweight health check endpoint. Configure your load balancer or monitoring to poll this every 30 seconds.
If you are serving multiple models, check that each model is loaded correctly after startup. A model that fails to load will not show an error on the health endpoint - requests to it will just fail at inference time.
The three most common failure modes
Context overflow: A request with a context larger than the model’s configured limit causes a silent failure or truncation. Set explicit context limits per model and validate incoming requests against them before they reach the model.
OOM (out of memory): When multiple large requests arrive simultaneously, the server runs out of VRAM or RAM and crashes. Set OLLAMA_MAX_LOADED_MODELS=1 if you are on a constrained machine. Queue concurrent requests rather than running them in parallel.
Disk full: Model files, logs, and Ollama’s temporary files all consume disk. Set up a disk usage alert at 80% capacity. A full disk causes unpredictable failures across everything.
Startup reliability
If you are running Ollama as a system service, ensure it restarts automatically on failure (Restart=on-failure in the systemd unit). Set a delay between restart attempts so a hard crash does not spin in a tight loop.
Pre-load your most used models on startup so the first inference request does not bear the cold-start latency:
# In your startup script:
ollama pull llama3.2 &
ollama pull mistral &
wait
What has broken in your self-hosted LLM setup that you did not see coming?
Curated by Selendia AI 🧰