Incident Response Flow
This map shows how a failed service moves from detection through resolution and changelog entry. Read this when a service goes down, when a Discord ops alert fires, or when debugging why an automated alert did not fire within 24 hours (G-FAILED-SERVICE-MTTR requires remediation within 24h). The flow is fully documented to prevent silent failures from persisting.
Diagram
flowchart LR A[Service enters failed state] --> B{Daily cron check} B -->|> 24h in failed| C[Discord #ops alert] B -->|< 24h| B C --> D[SESSION-AUDIT.md OPEN ISSUES] D --> E{Henry triages} E -->|investigate| F[/gsd:debug skill] E -->|simple fix known| G[Direct fix] F --> H[Root cause identified] G --> H H --> I{Fixable?} I -->|yes| J[Apply fix] I -->|no - disable| K[systemctl disable unit] I -->|no - archive| L[Archive per feedback_archive_not_delete] J --> M[Smoke test / verify] K --> M L --> M M -->|pass| N[CHANGELOG.md entry] M -->|fail| F N --> O[Close OPEN ISSUE in SESSION-AUDIT.md] O --> P[Post-tool audit log via on-post-tool.sh] P --> Q[Resolved]
How to read this
- Daily cron checks
systemctl --user list-units --state=failed; any service failed for >24h triggers a Discord#opsalert (G-FAILED-SERVICE-MTTR gate). - SESSION-AUDIT.md OPEN ISSUES is the primary tracking surface — every incident must have an entry here so it survives context loss between sessions.
/gsd:debugskill uses the scientific-method debugging workflow with persistent session state — invoke for any multi-step investigation to avoid losing context mid-debug.- Three resolution paths: (a) fix and verify, (b) explicitly disable the unit, (c) archive per
feedback_archive_not_delete.md— you may NOT leave a failed service indefinitely without choosing one of these. on-post-tool.shhook writes every tool-use to the audit log at/home/opsadmin/.openclaw/logs/claude-code-audit.log— the incident timeline is reconstructable from this log.- CHANGELOG.md must be touched within 14 days (G-GOVERNANCE-LOG-FRESHNESS). Every incident resolution provides a natural entry point.
Related
- governance-gates-network — shows G-FAILED-SERVICE-MTTR in the full gates network alongside G-GOVERNANCE-LOG-FRESHNESS
- ports-topology — service ports help identify which unit is in failed state
- cost-flow — cost-monitor Discord alerts follow the same response flow
- auth-chain-map — webhook handler failures (failed auth) are a common incident type
See also
- CLAUDE.md — G-FAILED-SERVICE-MTTR gate (Amendment §A1 cascade-failure gates)
- CLAUDE.md — Hook Lifecycle (on-post-tool.sh audit logger, on-stop.sh bridge writer)
- SESSION-AUDIT.md — OPEN ISSUES section where incidents are tracked
- CHANGELOG.md — where resolution entries are written (must stay < 14 days stale)
- AUDIT-LOG.md — fine-grained tool-use audit trail per session