Incident Response Flow

This map shows how a failed service moves from detection through resolution and changelog entry. Read this when a service goes down, when a Discord ops alert fires, or when debugging why an automated alert did not fire within 24 hours (G-FAILED-SERVICE-MTTR requires remediation within 24h). The flow is fully documented to prevent silent failures from persisting.

Diagram

flowchart LR
    A[Service enters failed state] --> B{Daily cron check}
    B -->|> 24h in failed| C[Discord #ops alert]
    B -->|< 24h| B
    C --> D[SESSION-AUDIT.md OPEN ISSUES]
    D --> E{Henry triages}
    E -->|investigate| F[/gsd:debug skill]
    E -->|simple fix known| G[Direct fix]
    F --> H[Root cause identified]
    G --> H
    H --> I{Fixable?}
    I -->|yes| J[Apply fix]
    I -->|no - disable| K[systemctl disable unit]
    I -->|no - archive| L[Archive per feedback_archive_not_delete]
    J --> M[Smoke test / verify]
    K --> M
    L --> M
    M -->|pass| N[CHANGELOG.md entry]
    M -->|fail| F
    N --> O[Close OPEN ISSUE in SESSION-AUDIT.md]
    O --> P[Post-tool audit log via on-post-tool.sh]
    P --> Q[Resolved]

How to read this

  • Daily cron checks systemctl --user list-units --state=failed; any service failed for >24h triggers a Discord #ops alert (G-FAILED-SERVICE-MTTR gate).
  • SESSION-AUDIT.md OPEN ISSUES is the primary tracking surface — every incident must have an entry here so it survives context loss between sessions.
  • /gsd:debug skill uses the scientific-method debugging workflow with persistent session state — invoke for any multi-step investigation to avoid losing context mid-debug.
  • Three resolution paths: (a) fix and verify, (b) explicitly disable the unit, (c) archive per feedback_archive_not_delete.md — you may NOT leave a failed service indefinitely without choosing one of these.
  • on-post-tool.sh hook writes every tool-use to the audit log at /home/opsadmin/.openclaw/logs/claude-code-audit.log — the incident timeline is reconstructable from this log.
  • CHANGELOG.md must be touched within 14 days (G-GOVERNANCE-LOG-FRESHNESS). Every incident resolution provides a natural entry point.
  • governance-gates-network — shows G-FAILED-SERVICE-MTTR in the full gates network alongside G-GOVERNANCE-LOG-FRESHNESS
  • ports-topology — service ports help identify which unit is in failed state
  • cost-flow — cost-monitor Discord alerts follow the same response flow
  • auth-chain-map — webhook handler failures (failed auth) are a common incident type

See also

  • CLAUDE.md — G-FAILED-SERVICE-MTTR gate (Amendment §A1 cascade-failure gates)
  • CLAUDE.md — Hook Lifecycle (on-post-tool.sh audit logger, on-stop.sh bridge writer)
  • SESSION-AUDIT.md — OPEN ISSUES section where incidents are tracked
  • CHANGELOG.md — where resolution entries are written (must stay < 14 days stale)
  • AUDIT-LOG.md — fine-grained tool-use audit trail per session