# Ops / runbooks ## Deep dives - [Deployment](deployment/README.md) - [Production](production/README.md) - [Troubleshooting](troubleshooting/README.md) This page is the operator/SRE entry point. It intentionally links to existing deeper docs to minimize churn. ## “First 30 minutes” incident checklist 1. **Confirm user impact + scope** - What is broken: UI, API, auth, or gateway integration? - Is it all users or a subset? 2. **Check service health** - Backend: `/healthz` and `/readyz` - Frontend: can it load? does it reach the API? 3. **Check auth (Clerk) configuration** - Frontend: is Clerk enabled unexpectedly? (publishable key set) - Backend: is `CLERK_JWKS_URL` configured correctly? 4. **Check DB connectivity** - Can backend connect to Postgres (`DATABASE_URL`)? 5. **Check logs** - Backend logs for 5xx spikes or auth failures. - Frontend logs for proxy/API URL misconfig. 6. **Stabilize** - Roll back the last change if available. - Temporarily disable optional integrations (gateway) to isolate. ## Backups / restore (placeholder) - Define backup cadence and restore steps once production deployment is finalized.