SPLIT-TEST-PROTOCOL — Messaging Vendor Randomized A/B/C/D Test v1

Purpose: This document defines the protocol for the Phase 2 controlled split test that determines which vendor(s) replace or supplement SalesMsg for RERI’s high-volume SMS campaigns. It is the canonical authority for group assignment, invariants, statistical planning, and result interpretation. The bulk_sms_split_test skill implements this protocol.


Why we split-test

RERI sends 700K+ outbound SMS segments per month at 0.007-$0.008/segment blended, and sent.dm offers per-contact pricing that wins at high drip ratios (breakeven vs Telnyx: 3.75x drip per contact per month). A 65% cost reduction is available at current volume.

But the decision cannot be made on price alone. Delivery rate, reply rate, qualified reply rate, and vendor reliability all affect actual ROI. This split test produces the evidence to make the decision correctly.


Pre-conditions for running the split test

All of the following must be true before the split test script accepts --live mode:

  1. Phase 1 smoke test evidence packet exists for every finalist vendor (workspace/reports/vendor-evidence/<vendor>-phase1-<date>.md with 11 PASS markers)
  2. All 11 gate checks (see NO-SEND-GATE-v1.md) are PASS for every finalist vendor
  3. Henry has issued Approval C (written approval in SESSION-AUDIT.md or Discord #ops)
  4. The split test run ID is registered in messaging_campaigns before any sends
  5. Population query has been reviewed and produces the expected N per group (dry-run first)
  6. Rollback monitor timer is active before first send (messaging-rollback-monitor.timer enabled)

Group structure

Groups are finalists only. The exact vendor-to-group mapping is set at Approval C time, based on Phase 1 evidence. Expected finalist count: 2-4. Exact group count adjusts accordingly.

Base structure (4 finalists):

GroupVendorRoleTarget N
0SalesMsgControl2500-5000
1Finalist 1Treatment A2500-5000
2Finalist 2Treatment B2500-5000
3Finalist 3Treatment C2500-5000
4Finalist 4 (conditional)Treatment D2500-5000
5(no vendor)Holdout500

If 2 finalists advance: groups 0 (control), 1 (treatment A), 2 (treatment B), 3 (holdout). N ~3500/group.

If 3 finalists advance: groups 0-3 + holdout (group 4). N ~2800/group.

Holdout group: receives no send. Used to measure baseline reply/interest rate from contacts who were not messaged. Holdout N is fixed at 500 regardless of total group count.


Group assignment algorithm

Assignments are deterministic and pre-computed before any sends. A contact assigned to Group 1 remains in Group 1 for the entire run window. No re-assignment.

group = hash(phone_e164 + run_id) % N_groups
  • phone_e164: contact’s E.164 phone number (normalized)
  • run_id: UUID generated at run registration, stored in messaging_campaigns
  • N_groups: total number of groups including holdout (e.g. 5 for 3 finalists + control + holdout)
  • Hash function: SHA-256, take first 4 bytes as unsigned int

Why phone + run_id: using phone alone would assign the same contacts to the same groups across all runs, creating a learning effect. Including run_id randomizes assignments per run.

Dry-run check: before any live sends, run --dry-run=plan to output the full group assignment table. Review for: expected N per group (within 10% of target), no duplicate assignments, no suppressed contacts in send groups.


Campaign invariants

The following are locked for the Phase 2 split test. Changing any invariant requires a new run_id and a full re-review:

InvariantValueReason
Campaign typecash_buyer_rebroadcastHighest volume, highest drip ratio (10.8x March 2026); most sensitive to per-contact pricing
StateCA only (first run)Largest single state; most comparable pool; easiest compliance validation
Segmentcash_buyer_warminvestorbase_buyers.score >= 70; avoids cold contacts that inflate opt-out rates
TemplateONE per group, variant-controlledNo template variation within a group; one change at a time
Time windowTuesday 11:00 PT, single-hour burstPeak engagement window per RERI historical data; avoids Friday fatigue
Cadence1 message per contact per runNo drip within the test window; clean per-contact cost accounting
Reply window72 hours from sendStandard engagement window; consistent across groups

Primary metric

Cost per qualified reply = vendor_cost_cents(group) / qualified_replies(group)

A qualified reply is defined as all three:

  1. Positive sentiment (not a complaint, not a hostile reply, not STOP)
  2. cash_buyer_interested intent (set by the acquisitions agent in omni_events.metadata)
  3. Received within 72 hours of the send

This metric captures the full economics: a cheaper vendor that generates more replies could still win even if its per-segment cost is higher. A vendor that is cheaper per segment but has 30% lower qualified-reply rate may not be the better choice.

Cost inputs per group:

  • SalesMsg: $0.022/segment (Henry’s quoted rate)
  • Telnyx, Bandwidth: messaging_outbound_messages.cost_cents from DLR (populated by gate check G11)
  • sent.dm: $0.015 per active contact (confirmed 2026-04-24)
  • Bird: 0.015/segment (blended; exact from DLR)

Secondary metrics

All secondary metrics are measured per group over the 72h reply window:

MetricFormulaSource
Delivery ratedelivered / attemptedmessaging_delivery_events
Reply rateinbound_within_72h / deliveredmessaging_inbound_messages
Qualified reply ratequalified_replies / deliveredomni_events (tagged by acq agent)
Opt-out ratestop_events / deliveredmessaging_suppression_events
Failure ratefailed / attemptedmessaging_outbound_messages.status
Webhook reliabilityevents within 60s / total eventswebhook_audit_log.occurred_at vs messaging_delivery_events.occurred_at
Support response timeFirst response to ticket (days)Manual tracking per vendor

Statistical plan

Sample size: N=2500 contacts per group yields:

  • MDE (minimum detectable effect) for delivery rate: 2 percentage points at 80% power
  • MDE for reply rate: 1 percentage point at 80% power
  • MDE for cost-per-qualified-reply: 20% relative difference at 80% power

Multiple comparisons: Bonferroni correction applied across all vendor-vs-control contrasts. With 4 treatments, alpha = 0.05 / 4 = 0.0125 per test.

Primary analysis: two-proportion z-test for delivery rate and reply rate; t-test on log(cost_per_qualified_reply). Report both raw p-values and Bonferroni-corrected p-values.

Intent-to-treat: contacts assigned to a group who did not receive the message (e.g. failed delivery) remain in the denominator for their group. This avoids selection bias from delivery failures.


Scorecard advancement criteria

A vendor advances from Phase 2 to Phase 3 consideration only if ALL of:

  1. Overall weighted score >= 65 (see VENDOR-SCORECARD-v1.md for weights)
  2. All 11 gate checks remained PASS throughout the Phase 2 window (no regression)
  3. No rollback trigger fired for this vendor during Phase 2 (see ROLLBACK-TRIGGERS-v1.md)
  4. Delivery rate >= 90% (floor; below this is a vendor health issue, not a test result)
  5. Opt-out rate 2x the SalesMsg control group opt-out rate (floor; above this indicates a content or timing issue introduced by the vendor change)

Run sequence

  1. Population query (dry-run): --dry-run=plan — outputs group assignment table + N per group + suppression exclusions. Review before proceeding.
  2. Sandbox pass: --dry-run=sandbox — routes all sends through vendor sandbox endpoints. Confirms API integration end-to-end. Produces sandbox evidence packet.
  3. Pre-send gate check: compliance gate validates every contact in every send group: suppression, cooldown, quiet hours, dedup. Any failure blocks that contact’s send (not the whole run).
  4. Send loop (--live): sends proceed group by group. Control (SalesMsg) sends first, then treatments in order. Full run completes within 60 minutes.
  5. 72h reply collection window: rollback monitor checks every 5 minutes via messaging-rollback-monitor.timer.
  6. Scorecard generation: run tools/vendor-scorecard-generator.js --run-id=<uuid> after 72h window closes.
  7. Henry review: present scorecard + raw evidence to Henry. Approval D required before Phase 3 planning.

Rollback during Phase 2

If any rollback trigger fires (see ROLLBACK-TRIGGERS-v1.md), the affected vendor’s group is halted immediately. SalesMsg control continues. Halted sends are logged in messaging_outbound_messages with status = 'halted_rollback'. The rollback trigger report is written to workspace/reports/rollback-triggers/<date>-trigger-N-<vendor>.md.

Rollback does NOT invalidate the entire run. Other vendor groups continue. The halted vendor’s partial data is still included in the scorecard with a “HALTED” flag.


Evidence artifacts to produce

At the end of Phase 2, the following must exist before Approval D is requested:

  • workspace/reports/split-test/<run-id>-results.md — full scorecard + statistical results
  • workspace/reports/split-test/<run-id>-group-assignments.csv — all contacts + group assignments
  • workspace/reports/vendor-evidence/<vendor>-phase2-<date>.md — per-vendor evidence packet
  • messaging_outbound_messages rows with split_test_run_id = <run-id> and cost_cents populated
  • messaging_inbound_messages rows for all replies in the 72h window
  • messaging_delivery_events rows for all DLRs

Limitations and known risks

  1. Single state (CA): Phase 2 results may not generalize to FL, TX, AZ, NV, GA sends. Phase 3 (out of scope) would validate other states.
  2. Template effect: if the control template underperforms due to SalesMsg formatting constraints, vendor cost differences may be overstated. Mitigated by using the same template body across groups where possible.
  3. Holdout confound: holdout contacts who independently reply or call are not captured as “qualified replies” attributable to the send. Holdout group is used only to measure baseline activity.
  4. Vendor outage during test: a vendor outage during Phase 2 triggers the rollback monitor but does not pause other groups. Partial group data is included with caveats.
  5. 10DLC campaign approval timing: Phase 2 cannot start until G2 (10DLC campaign) is PASS for every finalist vendor. Do not schedule Phase 2 until all approval timelines are confirmed.

Change log

DateChangeAuthor
2026-04-24Initial document created from plan Section CClaude Code

Version: v1 Owner: Henry Hill Last updated: 2026-04-24 Sourced from: messaging-vendor-phase-0-1-2026-04-23 plan Section C