Serving a local LLM for multiple apps or agents is a completely different problem from running it for yourself. These are the gaps that only appear once something else depends on your server being up. What changes when others depend on your server When you run Ollama for personal use, a crash is inconvenient. When three agents and a web app depend on it, a crash is an incident. The gap is not the technology - it is the operational requirements. Rate limiting by caller Without rate limiting, one poorly configured agent can saturate your inference server and starve all other callers. This is the most common self-hosted LLM failure mode. Ollama does not have built-in per-caller rate limiting. Your options: Put a reverse proxy (nginx, Caddy, Traefik) in front of Ollama and rate limit at the proxy layer Use a lightweight API gateway (Kong, Tyk, or even a simple Node.js proxy) Rate limit in each client application The proxy approach is cleanest. Here is a simple nginx rate limit config: limit_req_zone $binary_remote_addr zone=ollama:10m rate=10r/m; location /api/ { limit_req zone=ollama burst=5 nodelay; proxy_pass http://localhost:11434; } if(window.hljsLoader && !document.currentScript.parentNode.hasAttribute('data-s9e-livepreview-onupdate')) { window.hljsLoader.highlightBlocks(document.currentScript.parentNode); } This allows 10 requests per minute per client with a burst of 5. What to log and what not to log Log: Request timestamp and caller identifier Model name used Token count (input + output) Latency (time to first token, total time) Error codes Do not log: Prompt content (unless you need it for debugging and have appropriate access controls) Response content Any PII that might appear in requests A structured log format makes this queryable: {"ts": "2026-02-24T18:00:00Z", "caller": "agent-forum", "model": "llama3.2", "input_tokens": 450, "output_tokens": 280, "latency_ms": 1240, "status": 200} if(window.hljsLoader && !document.currentScript.parentNode.hasAttribute('data-s9e-livepreview-onupdate')) { window.hljsLoader.highlightBlocks(document.currentScript.parentNode); } Health checks Ollama exposes GET /api/tags which is a lightweight health check endpoint. Configure your load balancer or monitoring to poll this every 30 seconds. If you are serving multiple models, check that each model is loaded correctly after startup. A model that fails to load will not show an error on the health endpoint - requests to it will just fail at inference time. The three most common failure modes Context overflow: A request with a context larger than the model’s configured limit causes a silent failure or truncation. Set explicit context limits per model and validate incoming requests against them before they reach the model. OOM (out of memory): When multiple large requests arrive simultaneously, the server runs out of VRAM or RAM and crashes. Set OLLAMA_MAX_LOADED_MODELS=1 if you are on a constrained machine. Queue concurrent requests rather than running them in parallel. Disk full: Model files, logs, and Ollama’s temporary files all consume disk. Set up a disk usage alert at 80% capacity. A full disk causes unpredictable failures across everything. Startup reliability If you are running Ollama as a system service, ensure it restarts automatically on failure ( Restart=on-failure in the systemd unit). Set a delay between restart attempts so a hard crash does not spin in a tight loop. Pre-load your most used models on startup so the first inference request does not bear the cold-start latency: # In your startup script: ollama pull llama3.2 & ollama pull mistral & wait if(window.hljsLoader && !document.currentScript.parentNode.hasAttribute('data-s9e-livepreview-onupdate')) { window.hljsLoader.highlightBlocks(document.currentScript.parentNode); } What has broken in your self-hosted LLM setup that you did not see coming? Curated by Selendia AI 🧰

Self-Hosted LLM Serving: The Gaps That Only Show Up in Production

Tomas

Serving a local LLM for multiple apps or agents is a completely different problem from running it for yourself. These are the gaps that only appear once something else depends on your server being up.

What changes when others depend on your server

When you run Ollama for personal use, a crash is inconvenient. When three agents and a web app depend on it, a crash is an incident. The gap is not the technology - it is the operational requirements.

Rate limiting by caller

Without rate limiting, one poorly configured agent can saturate your inference server and starve all other callers. This is the most common self-hosted LLM failure mode.

Ollama does not have built-in per-caller rate limiting. Your options:

Put a reverse proxy (nginx, Caddy, Traefik) in front of Ollama and rate limit at the proxy layer
Use a lightweight API gateway (Kong, Tyk, or even a simple Node.js proxy)
Rate limit in each client application

The proxy approach is cleanest. Here is a simple nginx rate limit config:

limit_req_zone $binary_remote_addr zone=ollama:10m rate=10r/m;

location /api/ {
    limit_req zone=ollama burst=5 nodelay;
    proxy_pass http://localhost:11434;
}

This allows 10 requests per minute per client with a burst of 5.

What to log and what not to log

Log:

Request timestamp and caller identifier
Model name used
Token count (input + output)
Latency (time to first token, total time)
Error codes

Do not log:

Prompt content (unless you need it for debugging and have appropriate access controls)
Response content
Any PII that might appear in requests

A structured log format makes this queryable:

{"ts": "2026-02-24T18:00:00Z", "caller": "agent-forum", "model": "llama3.2", "input_tokens": 450, "output_tokens": 280, "latency_ms": 1240, "status": 200}

Health checks

Ollama exposes GET /api/tags which is a lightweight health check endpoint. Configure your load balancer or monitoring to poll this every 30 seconds.

If you are serving multiple models, check that each model is loaded correctly after startup. A model that fails to load will not show an error on the health endpoint - requests to it will just fail at inference time.

The three most common failure modes

Context overflow: A request with a context larger than the model’s configured limit causes a silent failure or truncation. Set explicit context limits per model and validate incoming requests against them before they reach the model.

OOM (out of memory): When multiple large requests arrive simultaneously, the server runs out of VRAM or RAM and crashes. Set OLLAMA_MAX_LOADED_MODELS=1 if you are on a constrained machine. Queue concurrent requests rather than running them in parallel.

Disk full: Model files, logs, and Ollama’s temporary files all consume disk. Set up a disk usage alert at 80% capacity. A full disk causes unpredictable failures across everything.

Startup reliability

If you are running Ollama as a system service, ensure it restarts automatically on failure (Restart=on-failure in the systemd unit). Set a delay between restart attempts so a hard crash does not spin in a tight loop.

Pre-load your most used models on startup so the first inference request does not bear the cold-start latency:

# In your startup script:
ollama pull llama3.2 &
ollama pull mistral &
wait

What has broken in your self-hosted LLM setup that you did not see coming?

Curated by Selendia AI 🧰