SPLIT-TEST-PROTOCOL — Messaging Vendor Randomized A/B/C/D Test v1

Purpose: This document defines the protocol for the Phase 2 controlled split test that determines which vendor(s) replace or supplement SalesMsg for RERI’s high-volume SMS campaigns. It is the canonical authority for group assignment, invariants, statistical planning, and result interpretation. The bulk_sms_split_test skill implements this protocol.

Why we split-test

RERI sends 700K+ outbound SMS segments per month at $0.022/ se g m e n t v ia S a l es M s g . D i rec t - c a rr i er a lt er na t i v es (T e l n y x, B an d w i d t h) o ff er$ 0.007-$0.008/segment blended, and sent.dm offers per-contact pricing that wins at high drip ratios (breakeven vs Telnyx: 3.75x drip per contact per month). A 65% cost reduction is available at current volume.

But the decision cannot be made on price alone. Delivery rate, reply rate, qualified reply rate, and vendor reliability all affect actual ROI. This split test produces the evidence to make the decision correctly.

Pre-conditions for running the split test

All of the following must be true before the split test script accepts --live mode:

Phase 1 smoke test evidence packet exists for every finalist vendor (workspace/reports/vendor-evidence/<vendor>-phase1-<date>.md with 11 PASS markers)
All 11 gate checks (see NO-SEND-GATE-v1.md) are PASS for every finalist vendor
Henry has issued Approval C (written approval in SESSION-AUDIT.md or Discord #ops)
The split test run ID is registered in messaging_campaigns before any sends
Population query has been reviewed and produces the expected N per group (dry-run first)
Rollback monitor timer is active before first send (messaging-rollback-monitor.timer enabled)

Group structure

Groups are finalists only. The exact vendor-to-group mapping is set at Approval C time, based on Phase 1 evidence. Expected finalist count: 2-4. Exact group count adjusts accordingly.

Base structure (4 finalists):

Group	Vendor	Role	Target N
0	SalesMsg	Control	2500-5000
1	Finalist 1	Treatment A	2500-5000
2	Finalist 2	Treatment B	2500-5000
3	Finalist 3	Treatment C	2500-5000
4	Finalist 4 (conditional)	Treatment D	2500-5000
5	(no vendor)	Holdout	500

If 2 finalists advance: groups 0 (control), 1 (treatment A), 2 (treatment B), 3 (holdout). N ~3500/group.

If 3 finalists advance: groups 0-3 + holdout (group 4). N ~2800/group.

Holdout group: receives no send. Used to measure baseline reply/interest rate from contacts who were not messaged. Holdout N is fixed at 500 regardless of total group count.

Group assignment algorithm

Assignments are deterministic and pre-computed before any sends. A contact assigned to Group 1 remains in Group 1 for the entire run window. No re-assignment.

group = hash(phone_e164 + run_id) % N_groups

phone_e164: contact’s E.164 phone number (normalized)
run_id: UUID generated at run registration, stored in messaging_campaigns
N_groups: total number of groups including holdout (e.g. 5 for 3 finalists + control + holdout)
Hash function: SHA-256, take first 4 bytes as unsigned int

Why phone + run_id: using phone alone would assign the same contacts to the same groups across all runs, creating a learning effect. Including run_id randomizes assignments per run.

Dry-run check: before any live sends, run --dry-run=plan to output the full group assignment table. Review for: expected N per group (within 10% of target), no duplicate assignments, no suppressed contacts in send groups.

Campaign invariants

The following are locked for the Phase 2 split test. Changing any invariant requires a new run_id and a full re-review:

Invariant	Value	Reason
Campaign type	`cash_buyer_rebroadcast`	Highest volume, highest drip ratio (10.8x March 2026); most sensitive to per-contact pricing
State	CA only (first run)	Largest single state; most comparable pool; easiest compliance validation
Segment	`cash_buyer_warm`	`investorbase_buyers.score >= 70`; avoids cold contacts that inflate opt-out rates
Template	ONE per group, variant-controlled	No template variation within a group; one change at a time
Time window	Tuesday 11:00 PT, single-hour burst	Peak engagement window per RERI historical data; avoids Friday fatigue
Cadence	1 message per contact per run	No drip within the test window; clean per-contact cost accounting
Reply window	72 hours from send	Standard engagement window; consistent across groups

Primary metric

Cost per qualified reply = vendor_cost_cents(group) / qualified_replies(group)

A qualified reply is defined as all three:

Positive sentiment (not a complaint, not a hostile reply, not STOP)
cash_buyer_interested intent (set by the acquisitions agent in omni_events.metadata)
Received within 72 hours of the send

This metric captures the full economics: a cheaper vendor that generates more replies could still win even if its per-segment cost is higher. A vendor that is cheaper per segment but has 30% lower qualified-reply rate may not be the better choice.

Cost inputs per group:

SalesMsg: $0.022/segment (Henry’s quoted rate)
Telnyx, Bandwidth: messaging_outbound_messages.cost_cents from DLR (populated by gate check G11)
sent.dm: $0.015 per active contact (confirmed 2026-04-24)
Bird: $0.008 -$ 0.015/segment (blended; exact from DLR)

Secondary metrics

All secondary metrics are measured per group over the 72h reply window:

Metric	Formula	Source
Delivery rate	`delivered / attempted`	`messaging_delivery_events`
Reply rate	`inbound_within_72h / delivered`	`messaging_inbound_messages`
Qualified reply rate	`qualified_replies / delivered`	`omni_events` (tagged by acq agent)
Opt-out rate	`stop_events / delivered`	`messaging_suppression_events`
Failure rate	`failed / attempted`	`messaging_outbound_messages.status`
Webhook reliability	`events within 60s / total events`	`webhook_audit_log.occurred_at` vs `messaging_delivery_events.occurred_at`
Support response time	First response to ticket (days)	Manual tracking per vendor

Statistical plan

Sample size: N=2500 contacts per group yields:

MDE (minimum detectable effect) for delivery rate: 2 percentage points at 80% power
MDE for reply rate: 1 percentage point at 80% power
MDE for cost-per-qualified-reply: 20% relative difference at 80% power

Multiple comparisons: Bonferroni correction applied across all vendor-vs-control contrasts. With 4 treatments, alpha = 0.05 / 4 = 0.0125 per test.

Primary analysis: two-proportion z-test for delivery rate and reply rate; t-test on log(cost_per_qualified_reply). Report both raw p-values and Bonferroni-corrected p-values.

Intent-to-treat: contacts assigned to a group who did not receive the message (e.g. failed delivery) remain in the denominator for their group. This avoids selection bias from delivery failures.

Scorecard advancement criteria

A vendor advances from Phase 2 to Phase 3 consideration only if ALL of:

Overall weighted score >= 65 (see VENDOR-SCORECARD-v1.md for weights)
All 11 gate checks remained PASS throughout the Phase 2 window (no regression)
No rollback trigger fired for this vendor during Phase 2 (see ROLLBACK-TRIGGERS-v1.md)
Delivery rate >= 90% (floor; below this is a vendor health issue, not a test result)
Opt-out rate ⇐ 2x the SalesMsg control group opt-out rate (floor; above this indicates a content or timing issue introduced by the vendor change)

Run sequence

Population query (dry-run): --dry-run=plan — outputs group assignment table + N per group + suppression exclusions. Review before proceeding.
Sandbox pass: --dry-run=sandbox — routes all sends through vendor sandbox endpoints. Confirms API integration end-to-end. Produces sandbox evidence packet.
Pre-send gate check: compliance gate validates every contact in every send group: suppression, cooldown, quiet hours, dedup. Any failure blocks that contact’s send (not the whole run).
Send loop (--live): sends proceed group by group. Control (SalesMsg) sends first, then treatments in order. Full run completes within 60 minutes.
72h reply collection window: rollback monitor checks every 5 minutes via messaging-rollback-monitor.timer.
Scorecard generation: run tools/vendor-scorecard-generator.js --run-id=<uuid> after 72h window closes.
Henry review: present scorecard + raw evidence to Henry. Approval D required before Phase 3 planning.

Rollback during Phase 2

If any rollback trigger fires (see ROLLBACK-TRIGGERS-v1.md), the affected vendor’s group is halted immediately. SalesMsg control continues. Halted sends are logged in messaging_outbound_messages with status = 'halted_rollback'. The rollback trigger report is written to workspace/reports/rollback-triggers/<date>-trigger-N-<vendor>.md.

Rollback does NOT invalidate the entire run. Other vendor groups continue. The halted vendor’s partial data is still included in the scorecard with a “HALTED” flag.

Evidence artifacts to produce

At the end of Phase 2, the following must exist before Approval D is requested:

workspace/reports/split-test/<run-id>-results.md — full scorecard + statistical results
workspace/reports/split-test/<run-id>-group-assignments.csv — all contacts + group assignments
workspace/reports/vendor-evidence/<vendor>-phase2-<date>.md — per-vendor evidence packet
messaging_outbound_messages rows with split_test_run_id = <run-id> and cost_cents populated
messaging_inbound_messages rows for all replies in the 72h window
messaging_delivery_events rows for all DLRs

Limitations and known risks

Single state (CA): Phase 2 results may not generalize to FL, TX, AZ, NV, GA sends. Phase 3 (out of scope) would validate other states.
Template effect: if the control template underperforms due to SalesMsg formatting constraints, vendor cost differences may be overstated. Mitigated by using the same template body across groups where possible.
Holdout confound: holdout contacts who independently reply or call are not captured as “qualified replies” attributable to the send. Holdout group is used only to measure baseline activity.
Vendor outage during test: a vendor outage during Phase 2 triggers the rollback monitor but does not pause other groups. Partial group data is included with caveats.
10DLC campaign approval timing: Phase 2 cannot start until G2 (10DLC campaign) is PASS for every finalist vendor. Do not schedule Phase 2 until all approval timelines are confirmed.

Change log

Date	Change	Author
2026-04-24	Initial document created from plan Section C	Claude Code

Version: v1 Owner: Henry Hill Last updated: 2026-04-24 Sourced from: messaging-vendor-phase-0-1-2026-04-23 plan Section C

Quartz 4

Explorer

SPLIT-TEST-PROTOCOL-v1