Incident Response Flow

This map shows how a failed service moves from detection through resolution and changelog entry. Read this when a service goes down, when a Discord ops alert fires, or when debugging why an automated alert did not fire within 24 hours (G-FAILED-SERVICE-MTTR requires remediation within 24h). The flow is fully documented to prevent silent failures from persisting.

Diagram

flowchart LR
    A[Service enters failed state] --> B{Daily cron check}
    B -->|> 24h in failed| C[Discord #ops alert]
    B -->|< 24h| B
    C --> D[SESSION-AUDIT.md OPEN ISSUES]
    D --> E{Henry triages}
    E -->|investigate| F[/gsd:debug skill]
    E -->|simple fix known| G[Direct fix]
    F --> H[Root cause identified]
    G --> H
    H --> I{Fixable?}
    I -->|yes| J[Apply fix]
    I -->|no - disable| K[systemctl disable unit]
    I -->|no - archive| L[Archive per feedback_archive_not_delete]
    J --> M[Smoke test / verify]
    K --> M
    L --> M
    M -->|pass| N[CHANGELOG.md entry]
    M -->|fail| F
    N --> O[Close OPEN ISSUE in SESSION-AUDIT.md]
    O --> P[Post-tool audit log via on-post-tool.sh]
    P --> Q[Resolved]

How to read this

Daily cron checks systemctl --user list-units --state=failed; any service failed for >24h triggers a Discord #ops alert (G-FAILED-SERVICE-MTTR gate).
SESSION-AUDIT.md OPEN ISSUES is the primary tracking surface — every incident must have an entry here so it survives context loss between sessions.
/gsd:debug skill uses the scientific-method debugging workflow with persistent session state — invoke for any multi-step investigation to avoid losing context mid-debug.
Three resolution paths: (a) fix and verify, (b) explicitly disable the unit, (c) archive per feedback_archive_not_delete.md — you may NOT leave a failed service indefinitely without choosing one of these.
on-post-tool.sh hook writes every tool-use to the audit log at /home/opsadmin/.openclaw/logs/claude-code-audit.log — the incident timeline is reconstructable from this log.
CHANGELOG.md must be touched within 14 days (G-GOVERNANCE-LOG-FRESHNESS). Every incident resolution provides a natural entry point.

governance-gates-network — shows G-FAILED-SERVICE-MTTR in the full gates network alongside G-GOVERNANCE-LOG-FRESHNESS
ports-topology — service ports help identify which unit is in failed state
cost-flow — cost-monitor Discord alerts follow the same response flow
auth-chain-map — webhook handler failures (failed auth) are a common incident type

Quartz 4

Explorer

Incident Response Flow

Incident Response Flow

Diagram

How to read this

See also

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

Incident Response Flow

Incident Response Flow

Diagram

How to read this

Related

See also

Graph View

Table of Contents

Backlinks