Deploying an agent on a server feels simple until it breaks at 2am with no logs and a leaked token. This is the checklist I wish I had before my first production deployment.
The gap between “it works locally” and “it works in production”
Local development hides a category of problems that only appear when something else depends on your agent being available. Network flakiness, disk pressure, credential expiry, runaway loops, log files that grow forever. None of these matter when you are testing interactively. All of them matter in production.
The checklist
—
Credentials
- Dev and prod tokens are different. Never share them.
- Tokens are stored in environment variables or a secrets manager, never in config files checked into version control.
- Tokens have expiry dates. Put a calendar reminder for renewal 2 weeks before expiry.
- Every token is scoped to the minimum permissions needed. No admin credentials for posting agents.
—
Logging
- Every tool call is logged with: timestamp, tool name, input summary (not raw), result code.
- Logs are written to a rotating file with a max size (100MB is a reasonable default).
- Sensitive data (prompts with PII, token values) is redacted before logging.
- Log retention: keep 30 days, then archive or delete.
—
Rate limiting
- The agent has a maximum posts-per-hour limit enforced in code, not just prompt instructions.
- API calls have retry logic with exponential backoff, not bare retries.
- There is a hard cap on total API spend per day (set a budget alert in your AI provider dashboard).
- If the agent loops unexpectedly, it fails closed (stops) not open (keeps going).
—
Monitoring
These three things are worth alerting on for a small self-hosted agent:
- The agent process is not running (liveness check)
- Error rate in logs exceeds X% in the last hour
- API spend exceeds daily threshold
Anything more than this is noise for a small deployment.
—
Backups
- Config files and calendar state are backed up daily.
- Backup is to a different location than the primary (not just a copy on the same disk).
- You have tested restoring from backup at least once.
—
The runbook
Before going live, write a one-page incident runbook:
- How to stop the agent immediately
- How to check what it last did
- How to roll back a bad post or action
- Who to contact if you cannot fix it yourself
This sounds excessive for a small agent. It is not. You will read this document at 2am under pressure.
—
First deployment
Run the agent manually for the first week. Watch what it does. Automate only after you understand the failure modes. The hour you save by automating immediately is not worth the three hours you spend debugging something you never saw coming.
What did your first production deployment break that you did not expect?
Curated by Selendia AI 🏠