SPLIT-TEST-PROTOCOL — Messaging Vendor Randomized A/B/C/D Test v1
Purpose: This document defines the protocol for the Phase 2 controlled split test that determines which vendor(s) replace or supplement SalesMsg for RERI’s high-volume SMS campaigns. It is the canonical authority for group assignment, invariants, statistical planning, and result interpretation. The
bulk_sms_split_testskill implements this protocol.
Why we split-test
RERI sends 700K+ outbound SMS segments per month at 0.007-$0.008/segment blended, and sent.dm offers per-contact pricing that wins at high drip ratios (breakeven vs Telnyx: 3.75x drip per contact per month). A 65% cost reduction is available at current volume.
But the decision cannot be made on price alone. Delivery rate, reply rate, qualified reply rate, and vendor reliability all affect actual ROI. This split test produces the evidence to make the decision correctly.
Pre-conditions for running the split test
All of the following must be true before the split test script accepts --live mode:
- Phase 1 smoke test evidence packet exists for every finalist vendor (
workspace/reports/vendor-evidence/<vendor>-phase1-<date>.mdwith 11 PASS markers) - All 11 gate checks (see
NO-SEND-GATE-v1.md) are PASS for every finalist vendor - Henry has issued Approval C (written approval in SESSION-AUDIT.md or Discord
#ops) - The split test run ID is registered in
messaging_campaignsbefore any sends - Population query has been reviewed and produces the expected N per group (dry-run first)
- Rollback monitor timer is active before first send (
messaging-rollback-monitor.timerenabled)
Group structure
Groups are finalists only. The exact vendor-to-group mapping is set at Approval C time, based on Phase 1 evidence. Expected finalist count: 2-4. Exact group count adjusts accordingly.
Base structure (4 finalists):
| Group | Vendor | Role | Target N |
|---|---|---|---|
| 0 | SalesMsg | Control | 2500-5000 |
| 1 | Finalist 1 | Treatment A | 2500-5000 |
| 2 | Finalist 2 | Treatment B | 2500-5000 |
| 3 | Finalist 3 | Treatment C | 2500-5000 |
| 4 | Finalist 4 (conditional) | Treatment D | 2500-5000 |
| 5 | (no vendor) | Holdout | 500 |
If 2 finalists advance: groups 0 (control), 1 (treatment A), 2 (treatment B), 3 (holdout). N ~3500/group.
If 3 finalists advance: groups 0-3 + holdout (group 4). N ~2800/group.
Holdout group: receives no send. Used to measure baseline reply/interest rate from contacts who were not messaged. Holdout N is fixed at 500 regardless of total group count.
Group assignment algorithm
Assignments are deterministic and pre-computed before any sends. A contact assigned to Group 1 remains in Group 1 for the entire run window. No re-assignment.
group = hash(phone_e164 + run_id) % N_groups
phone_e164: contact’s E.164 phone number (normalized)run_id: UUID generated at run registration, stored inmessaging_campaignsN_groups: total number of groups including holdout (e.g. 5 for 3 finalists + control + holdout)- Hash function: SHA-256, take first 4 bytes as unsigned int
Why phone + run_id: using phone alone would assign the same contacts to the same groups across all runs, creating a learning effect. Including run_id randomizes assignments per run.
Dry-run check: before any live sends, run --dry-run=plan to output the full group assignment table. Review for: expected N per group (within 10% of target), no duplicate assignments, no suppressed contacts in send groups.
Campaign invariants
The following are locked for the Phase 2 split test. Changing any invariant requires a new run_id and a full re-review:
| Invariant | Value | Reason |
|---|---|---|
| Campaign type | cash_buyer_rebroadcast | Highest volume, highest drip ratio (10.8x March 2026); most sensitive to per-contact pricing |
| State | CA only (first run) | Largest single state; most comparable pool; easiest compliance validation |
| Segment | cash_buyer_warm | investorbase_buyers.score >= 70; avoids cold contacts that inflate opt-out rates |
| Template | ONE per group, variant-controlled | No template variation within a group; one change at a time |
| Time window | Tuesday 11:00 PT, single-hour burst | Peak engagement window per RERI historical data; avoids Friday fatigue |
| Cadence | 1 message per contact per run | No drip within the test window; clean per-contact cost accounting |
| Reply window | 72 hours from send | Standard engagement window; consistent across groups |
Primary metric
Cost per qualified reply = vendor_cost_cents(group) / qualified_replies(group)
A qualified reply is defined as all three:
- Positive sentiment (not a complaint, not a hostile reply, not STOP)
cash_buyer_interestedintent (set by the acquisitions agent inomni_events.metadata)- Received within 72 hours of the send
This metric captures the full economics: a cheaper vendor that generates more replies could still win even if its per-segment cost is higher. A vendor that is cheaper per segment but has 30% lower qualified-reply rate may not be the better choice.
Cost inputs per group:
- SalesMsg: $0.022/segment (Henry’s quoted rate)
- Telnyx, Bandwidth:
messaging_outbound_messages.cost_centsfrom DLR (populated by gate check G11) - sent.dm: $0.015 per active contact (confirmed 2026-04-24)
- Bird: 0.015/segment (blended; exact from DLR)
Secondary metrics
All secondary metrics are measured per group over the 72h reply window:
| Metric | Formula | Source |
|---|---|---|
| Delivery rate | delivered / attempted | messaging_delivery_events |
| Reply rate | inbound_within_72h / delivered | messaging_inbound_messages |
| Qualified reply rate | qualified_replies / delivered | omni_events (tagged by acq agent) |
| Opt-out rate | stop_events / delivered | messaging_suppression_events |
| Failure rate | failed / attempted | messaging_outbound_messages.status |
| Webhook reliability | events within 60s / total events | webhook_audit_log.occurred_at vs messaging_delivery_events.occurred_at |
| Support response time | First response to ticket (days) | Manual tracking per vendor |
Statistical plan
Sample size: N=2500 contacts per group yields:
- MDE (minimum detectable effect) for delivery rate: 2 percentage points at 80% power
- MDE for reply rate: 1 percentage point at 80% power
- MDE for cost-per-qualified-reply: 20% relative difference at 80% power
Multiple comparisons: Bonferroni correction applied across all vendor-vs-control contrasts. With 4 treatments, alpha = 0.05 / 4 = 0.0125 per test.
Primary analysis: two-proportion z-test for delivery rate and reply rate; t-test on log(cost_per_qualified_reply). Report both raw p-values and Bonferroni-corrected p-values.
Intent-to-treat: contacts assigned to a group who did not receive the message (e.g. failed delivery) remain in the denominator for their group. This avoids selection bias from delivery failures.
Scorecard advancement criteria
A vendor advances from Phase 2 to Phase 3 consideration only if ALL of:
- Overall weighted score >= 65 (see VENDOR-SCORECARD-v1.md for weights)
- All 11 gate checks remained PASS throughout the Phase 2 window (no regression)
- No rollback trigger fired for this vendor during Phase 2 (see ROLLBACK-TRIGGERS-v1.md)
- Delivery rate >= 90% (floor; below this is a vendor health issue, not a test result)
- Opt-out rate ⇐ 2x the SalesMsg control group opt-out rate (floor; above this indicates a content or timing issue introduced by the vendor change)
Run sequence
- Population query (dry-run):
--dry-run=plan— outputs group assignment table + N per group + suppression exclusions. Review before proceeding. - Sandbox pass:
--dry-run=sandbox— routes all sends through vendor sandbox endpoints. Confirms API integration end-to-end. Produces sandbox evidence packet. - Pre-send gate check: compliance gate validates every contact in every send group: suppression, cooldown, quiet hours, dedup. Any failure blocks that contact’s send (not the whole run).
- Send loop (
--live): sends proceed group by group. Control (SalesMsg) sends first, then treatments in order. Full run completes within 60 minutes. - 72h reply collection window: rollback monitor checks every 5 minutes via
messaging-rollback-monitor.timer. - Scorecard generation: run
tools/vendor-scorecard-generator.js --run-id=<uuid>after 72h window closes. - Henry review: present scorecard + raw evidence to Henry. Approval D required before Phase 3 planning.
Rollback during Phase 2
If any rollback trigger fires (see ROLLBACK-TRIGGERS-v1.md), the affected vendor’s group is halted immediately. SalesMsg control continues. Halted sends are logged in messaging_outbound_messages with status = 'halted_rollback'. The rollback trigger report is written to workspace/reports/rollback-triggers/<date>-trigger-N-<vendor>.md.
Rollback does NOT invalidate the entire run. Other vendor groups continue. The halted vendor’s partial data is still included in the scorecard with a “HALTED” flag.
Evidence artifacts to produce
At the end of Phase 2, the following must exist before Approval D is requested:
workspace/reports/split-test/<run-id>-results.md— full scorecard + statistical resultsworkspace/reports/split-test/<run-id>-group-assignments.csv— all contacts + group assignmentsworkspace/reports/vendor-evidence/<vendor>-phase2-<date>.md— per-vendor evidence packetmessaging_outbound_messagesrows withsplit_test_run_id = <run-id>andcost_centspopulatedmessaging_inbound_messagesrows for all replies in the 72h windowmessaging_delivery_eventsrows for all DLRs
Limitations and known risks
- Single state (CA): Phase 2 results may not generalize to FL, TX, AZ, NV, GA sends. Phase 3 (out of scope) would validate other states.
- Template effect: if the control template underperforms due to SalesMsg formatting constraints, vendor cost differences may be overstated. Mitigated by using the same template body across groups where possible.
- Holdout confound: holdout contacts who independently reply or call are not captured as “qualified replies” attributable to the send. Holdout group is used only to measure baseline activity.
- Vendor outage during test: a vendor outage during Phase 2 triggers the rollback monitor but does not pause other groups. Partial group data is included with caveats.
- 10DLC campaign approval timing: Phase 2 cannot start until G2 (10DLC campaign) is PASS for every finalist vendor. Do not schedule Phase 2 until all approval timelines are confirmed.
Change log
| Date | Change | Author |
|---|---|---|
| 2026-04-24 | Initial document created from plan Section C | Claude Code |
Version: v1 Owner: Henry Hill Last updated: 2026-04-24 Sourced from: messaging-vendor-phase-0-1-2026-04-23 plan Section C