ROLLBACK-TRIGGERS — Messaging Vendor Phase 2 Rollback Trigger Definitions v1
Purpose: This document defines the 7 automated rollback triggers that run every 5 minutes during the Phase 2 split test. If any trigger fires for a vendor, that vendor’s send group is halted immediately. The SalesMsg control group continues. This document is the canonical authority; the
messaging-rollback-monitor.timersystemd unit enforces it.
How the rollback monitor works
The messaging-rollback-monitor.timer runs every 5 minutes throughout the Phase 2 72-hour reply window. It queries the live data in messaging_outbound_messages, messaging_delivery_events, messaging_inbound_messages, and messaging_suppression_events, then evaluates each trigger condition per active vendor.
If a trigger fires:
- The vendor’s remaining queued messages are marked
status = 'halted_rollback'inmessaging_outbound_messages - A rollback report is written to
workspace/reports/rollback-triggers/<date>-trigger-N-<vendor>.md - An alert is posted to Discord
#ops - The Phase 2 scorecard marks the vendor with
HALTEDstatus
Halted groups do NOT invalidate the rest of the run. Other vendor groups continue. Partial data from the halted group is included in the final scorecard with caveats.
Rollback does NOT mean “vendor is permanently eliminated.” It means the automated safeguard fired and Henry must review before any further sends on that vendor.
The 7 rollback triggers
Trigger 1: Delivery rate below floor
Trigger ID: RT1
Condition: delivered / attempted < 0.85 for a vendor group, measured over any 30-minute rolling window with at least 50 sends attempted
Why it fires: Delivery below 85% over a sustained window indicates carrier filtering, 10DLC compliance rejection, or number reputation issues. These are systemic problems, not statistical noise.
Action: Halt remaining sends for that vendor. Continue control + other treatment groups.
Evidence to capture: delivery rate time series (5-min intervals) from messaging_delivery_events
False positive risk: Low. A genuine outage would also show up in the vendor’s status page.
Recovery path: Contact vendor support. Investigate carrier filtering. Confirm 10DLC campaign status. Do not resume without G2 re-verification.
Trigger 2: Opt-out rate spike
Trigger ID: RT2
Condition: stop_events / delivered > 1.5 * control_stop_rate for a vendor group, measured over a 1-hour rolling window with at least 200 delivered messages
Why it fires: An opt-out rate 1.5x the control indicates the vendor is doing something that annoys recipients differently from SalesMsg: different formatting, different from-number presentation, or delivery of duplicate messages.
Action: Halt remaining sends for that vendor. Preserve evidence for Henry review.
Evidence to capture: opt-out rate time series, sample of messages that triggered opt-outs, from-number used
False positive risk: Medium. Seasonal or campaign-level variation can cause opt-out spikes. Review before declaring a vendor problem.
Recovery path: Audit message content, from-number, and timing differences between control and the halted group.
Trigger 3: Duplicate send rate above threshold
Trigger ID: RT3
Condition: duplicate_sends / total_sends > 0.001 (0.1%) for a vendor group in any 30-minute window
Why it fires: Duplicate sends to the same contact within the cooldown window indicate the compliance gate or dedup logic has a bug. Even 0.1% at 5000 sends = 5 duplicate contacts, which is unacceptable.
Action: IMMEDIATE halt for that vendor. This is a compliance issue, not just a performance issue.
Evidence to capture: List of phone numbers that received duplicates, gate log entries for those sends, processed_webhook_events entries
False positive risk: Very low. Duplicates at >0.1% are almost always a code bug.
Recovery path: Identify the dedup failure point (G9 gate regression?). Fix the bug. Full gate check re-run required before resuming.
Trigger 4: STOP keyword not processed
Trigger ID: RT4 Condition: Any contact who replied STOP to the vendor’s number receives a subsequent outbound message from that vendor within the Phase 2 window Why it fires: Sending to a STOP-opted contact is a TCPA violation. This is the most serious trigger. Action: IMMEDIATE halt for that vendor. Alert Henry directly (Discord DM, not just ops). Evidence to capture: The specific message IDs involved, timeline of STOP reply vs subsequent send, suppression table state at time of send False positive risk: Extremely low if G6 gate check passed. If RT4 fires, G6 was either not run or has regressed. Recovery path: Full audit of suppression processing pipeline. Vendor cannot resume until STOP processing is verified working end-to-end. Legal review may be warranted depending on the specific contact.
Trigger 5: Webhook delivery lag sustained above threshold
Trigger ID: RT5
Condition: Median time from send to DLR receipt > 5 minutes, sustained over a 30-minute window with at least 50 DLR events
Why it fires: Webhook lag > 5 minutes sustained means the vendor’s delivery reporting is too slow to support real-time compliance decisions (cooldown checks, STOP processing). It also inflates the reply window measurement.
Action: Pause new sends for that vendor. Investigate before resuming.
Evidence to capture: DLR latency histogram from webhook_audit_log, vendor status page at time of trigger
False positive risk: Medium. Vendor outages cause temporary lag spikes. Check vendor status page before halting.
Recovery path: Vendor webhook health check. If vendor confirms outage and it resolves, can resume after 30-minute clean window.
Trigger 6: Wrong contact context delivered
Trigger ID: RT6
Condition: Any outbound message body does not contain the expected template variable substitutions (detected by pattern match on message body in messaging_outbound_messages.body)
Why it fires: Template rendering failures send contacts raw variable placeholders like {property_type} or blank fields. This is a brand/quality issue at any volume, and at 5000 sends it becomes a significant problem.
Action: Halt remaining sends for that vendor.
Evidence to capture: Sample of malformed message bodies, template name and variable map used, vendor API response at time of send
False positive risk: Low. Template rendering failures are deterministic.
Recovery path: Identify whether the failure is in the template variable population code or the vendor API. Fix and re-test with --dry-run=plan before resuming.
Trigger 7: Vendor API sustained outage
Trigger ID: RT7
Condition: Vendor API returns non-200 responses for >= 10 consecutive send attempts spanning at least 5 minutes
Why it fires: A sustained API outage during the send window means the vendor group is effectively not participating in the split test. Continued attempts would artificially inflate the failure rate in the scorecard.
Action: Pause sends for that vendor. Mark unsent contacts in the queue as status = 'halted_outage' (distinct from halted_rollback).
Evidence to capture: API error codes + response bodies from messaging_outbound_messages.metadata, vendor status page
False positive risk: Low if 10 consecutive failures across 5 minutes.
Recovery path: Monitor vendor status page. Resume sends if outage resolves within the 72h window. If outage extends > 4 hours, the vendor’s Phase 2 participation is considered void and Henry decides whether to re-run.
Trigger summary matrix
| # | Trigger | Threshold | Scope | Severity | Action |
|---|---|---|---|---|---|
| RT1 | Delivery below floor | < 85% over 30min (N >= 50) | Per-vendor group | High | Halt vendor |
| RT2 | Opt-out spike | > 1.5x control rate over 1h (N >= 200) | Per-vendor group | High | Halt vendor |
| RT3 | Duplicate send | > 0.1% in 30min | Per-vendor group | Critical | Immediate halt |
| RT4 | STOP not honored | Any subsequent send to STOP contact | Per-vendor group | Critical | Immediate halt + Henry DM |
| RT5 | Webhook lag | Median DLR > 5min over 30min (N >= 50) | Per-vendor group | Medium | Pause + investigate |
| RT6 | Wrong context | Any malformed template render | Per-vendor group | High | Halt vendor |
| RT7 | API outage | 10 consecutive failures over 5min | Per-vendor group | Medium | Pause + monitor |
Report format
Each trigger firing writes a report to:
workspace/reports/rollback-triggers/<YYYY-MM-DD>-trigger-<N>-<vendor>.md
Report must contain:
- Trigger ID and name
- Timestamp of firing
- Data that triggered it (exact metric values, threshold, evidence rows)
- Action taken (halted / paused)
- Number of sends halted (unsent contacts in queue)
- Recovery steps taken or required
Re-enabling a halted vendor
A vendor halted by an automated trigger requires:
- Written root cause analysis (what caused the trigger to fire)
- Fix implemented and tested
- Gate check re-run for the specific gate that corresponds to the trigger (e.g. RT4 requires G6 re-run)
- Henry approval to re-enable
Halted vendors within an active Phase 2 run cannot be re-enabled for that run. Their partial data is used as-is in the scorecard. Re-enabling would require a new run_id.
Change log
| Date | Change | Author |
|---|---|---|
| 2026-04-24 | Initial document created from plan Section I | Claude Code |
Version: v1 Owner: Henry Hill Last updated: 2026-04-24 Sourced from: messaging-vendor-phase-0-1-2026-04-23 plan Section I