state-active-deals-pipeline

⚡ Execution-mode model routing (G-MODEL-ROUTING-AT-EXEC)

This skill is execution-mode. Per CLAUDE.md “MANDATORY: Execution-Mode Model Routing” + Henry directive 2026-05-05 (“always have a popup question box asking what model to use, with recommendations + why, not just Opus/Sonnet defaults”):

Step 1 of every invocation: Invoke the /model-selector skill BEFORE dispatching any subagent. It pops the AskUserQuestion box with candidate models (Gemma 26B / Gemma 31B / Haiku / Sonnet / Kimi / Opus) ranked by fit for the stage, with cost / latency / “why” per option. Henry picks; that model is then used for the dispatch.

Per stage recommendations the popup should default to (Henry overrides):

  • Stage 1 (IL pull via aws-mac) — Haiku (mechanical SSH+curl, no reasoning)
  • Stage 2 (audit) — Sonnet (need to interpret column meanings, sample data)
  • Stage 3 (sync) — Haiku (mechanical script run)
  • Stage 4 (Gemma enrich) — Gemma 26B for the per-deal extraction (already wired in enrich-il-deals-detail-api.js --use-gemma); Sonnet for the orchestration shell
  • Stage 5 (Google geocode) — Haiku (mechanical API loop)
  • Stage 6 (photo pull) — Haiku for orchestration; the IL HTML extraction is Mac-side
  • Stage 7 (vision analysis) — see VOICE-PB-2 plan; per-photo runs the 4-model split test, orchestration is Sonnet
  • Stage 8 (HS sync) — Haiku for batch create; Sonnet if any deal needs deduplication judgment

Opus role: orchestrator + verifier + Henry-gate. Never executes mechanical work directly. First-instance novel work (e.g. building this skill v1) is Opus; once a pattern is proven it’s a popup-selectable mechanical dispatch.

Why this skill exists

Every state’s “fresh active deals” run before this followed the same 7 stages but as ad-hoc commands. Henry repeatedly called out:

  • “This is the same process — should be a skill.” → V1 ships now.
  • “Stop being assumptive.” → Audit logic now distinguishes real street addresses (parsed from description, geocoded) from marketing-title strings (“Subto Rental Opportunity”, “100k Spread Daytona Flip”) that look populated but aren’t usable.
  • “Phone/email numbers are stale if Gemma hasn’t run with current prompts.” → Audit reports llm_extraction_method distribution explicitly so contact-coverage % isn’t misleading.
  • “InvestorLift Scraping — ALWAYS Via AWS Mac.” → All IL API calls (list AND detail) route through aws-mac. VPS is CloudFront-blocked.

Lessons embedded (from FL run 2026-05-05)

LessonWhere it shows up
address column being non-null ≠ usable addressaudit.js checks address_from_google OR address_from_osm for real coverage; flags marketing titles
contact_phone being populated ≠ Gemma-extractedaudit.js reports llm_extraction_method distribution; (null) count = “stale extraction, redo”
has_photos boolean is unreliable across pipelineaudit.js checks drive_folder_url instead
IL list API has 1000-row default page on SupabasePull always uses count: 'exact' and chunks .in() queries by 100
aws-mac may need Tailscale SSH re-auth periodicallySkill probes SSH first; surfaces auth URL if blocked
Home Mac Ultra (100.123.248.46) frequently downSkill defaults to aws-mac (100.90.38.109) for all IL calls

Modes

# AUDIT ONLY — fresh API pull + honest gap report. No writes.
/state-active-deals-pipeline audit FL
/state-active-deals-pipeline audit CA
# Output: /tmp/{STATE}-active-audit.json with per-gap raw_id arrays
 
# RUN — gated execution of all 7 stages
/state-active-deals-pipeline run FL
/state-active-deals-pipeline run FL --gaps=insert,gemma,geocode,photos,vision,hs
# Each stage prompts for confirmation before live writes
 
# RERUN — same as RUN but with --force on all enrichment (Gemma re-extracts everything,
# photos re-pull, vision re-analyze). Use when prompts/methods changed.
/state-active-deals-pipeline rerun FL

The 7 stages

Stage 0 — Probe (always runs, no writes)

  • Verify aws-mac SSH (ssh aws-mac echo OK); if blocked, surface Tailscale auth URL
  • Verify IL cookie exists on aws-mac (/Users/ec2-user/openclaw/workspace/data/investorlift-cookies-raw.txt)
  • Verify cookie age (warn if >7 days; refresh via investorlift-refresh-cookies.js on Mac)

Stage 1 — Fresh IL list API pull (via aws-mac)

  • SSH aws-mac → curl IL list API with cookie → save to /tmp/investorlift-all-deals.json on Mac
  • scp back to VPS at /tmp/investorlift-all-deals.json
  • Filter to --state=XX subset, save at /tmp/il-{state}-fresh.json
  • Report: total deals fetched, state subset count, breakdown by county

Stage 2 — Honest audit (always runs after Stage 1)

  • Cross-reference fresh state raw_ids against acquisition_deals
  • For each metric, distinguish “column populated” from “really populated”:
    • Real address coverage = deals with address_from_google IS NOT NULL OR address_from_osm IS NOT NULL (NOT just address IS NOT NULL)
    • Gemma coverage = deals with llm_extraction_method IS NOT NULL (NOT contact_phone presence)
    • Photo coverage = deals with drive_folder_url IS NOT NULL (NOT has_photos boolean)
    • HS coverage = deals with hubspot_deal_id IS NOT NULL
  • Sample 3-5 records from each gap so Henry sees raw data, not just %
  • Output: console report + /tmp/{state}-active-audit.json

Stage 3 — Sync to Supabase (light writes)

  • node sync-il-api-to-supabase.js --state=XX
  • Inserts new, updates existing, marks expired (whole-table sweep — known intended behavior)
  • Verify post-sync coverage = 100% via re-query

Stage 4 — Gemma re-enrichment (all 494 if rerun mode, gap-only if run mode)

  • node enrich-il-deals-detail-api.js --state=XX --use-gemma [--force]
  • Sequential, IL rate-limited 1.5s/call → ~12 min per 100 deals
  • Refreshes: contact_phone, contact_name, contact_email, address_from_desc, financing_type, seller_motivation, closing_timeline, outreach_hook, deal_strategies, key_facts, red_flags, llm_confidence, llm_extraction_method, llm_extracted_at
  • Cost: ~0.0001/deal → ~$0.05 for 494 deals

Stage 5 — Google geocoding (only deals with real addresses post-Gemma)

  • For each deal where address_from_desc IS NOT NULL AND address_from_google IS NULL
  • Hit Google Geocoding API → write address_from_google + lat/lng updates
  • Cost: 0.50 per 100 deals

Stage 6 — Photo pull via aws-mac

  • node il-folder-photos.js --bulk --state=XX (after MAC_HOST update from 100.123.248.46aws-mac)
  • SSH to aws-mac → fetch IL property HTML with cookie → extract S3 sendlift image URLs → download to /tmp/il-photos/{deal_id}/
  • Upload to GDrive (acq+dispo shared folder) → write drive_folder_url to acquisition_deals
  • ~10-30 photos per deal, ~3-4 hours for 494 deals (rate-limited)

Stage 7 — Vision analysis (NEW — uses VOICE-PB-3 batch script when ready)

  • For each photo: 4-model split test on N stratified sample, then production winner on rest
  • Models per VOICE-PB-2 plan: Gemini 2.0 Flash, Gemma 3 27B, Qwen2.5-VL 72B, Qwen3-VL 235B
  • Output: property_photos rows + acquisition_deals rollup columns
  • Cost: ~$0.40 per 1K photos on Gemini 2.0 Flash

Stage 8 — HubSpot deal+contact sync

  • For each deal where hubspot_deal_id IS NULL OR --force
  • Calls hubspot-deal-creator.js via /hubspot-deal-ingest skill
  • Creates/updates HS deal + finds-or-creates contact, associates them
  • Writes hubspot_deal_id + hubspot_contact_ids back to acquisition_deals

Composition with other skills

StepCalls intoSkill or script
1aws-mac IL pullworkspace/scripts/investorlift-daily-sync.sh (curl portion)
3Syncworkspace/scripts/sync-il-api-to-supabase.js
4Gemma enrichworkspace/scripts/enrich-il-deals-detail-api.js (uses lib/gemma-deal-extractor.js)
6Photosworkspace/scripts/il-folder-photos.js (needs aws-mac MAC_HOST update)
7VisionNEW batch script (VOICE-PB-3, in progress)
8HubSpot/hubspot-deal-ingest skill → workspace/scripts/hubspot-deal-creator.js

Failure modes + recovery

SymptomCauseRecovery
Tailscale SSH requires an additional checkaws-mac auth expiredHenry visits the printed https://login.tailscale.com/a/... URL once
IL API 403 from VPSCloudFront IP blockAlways route via aws-mac. Per CLAUDE.md hard rule.
IL cookie stale (>7 days)Cookie expiredRun investorlift-refresh-cookies.js on aws-mac (Playwright auto-refresh)
relation "acquisition_deals" does not existWrong Supabase projectVerify SUPABASE_URL = CCP svueekfvfrvhylxygktb
Gemma enrich 0/N progressPortkey proxy down OR pc-opencl-aaae2d config missingsystemctl --user status portkey-proxy + verify config
Photos pull SSH timeoutMac unreachableConfirm aws-mac (NOT 100.123.248.46) — see Lessons embedded
HS deal create 401HUBSPOT_PAT rotatedRefresh PAT, update master.env

Cost envelope (per state, assuming ~500 active deals)

StageOne-timeRecurring (per refresh)
1-3 (pull, audit, sync)$0$0
4 (Gemma enrich, ~$0.0001/deal)$0.05$0.05 (delta)
5 (Google geocode, $5/1K)$0.50$0.50 (delta)
6 (photos via Mac, infra cost only)~$0~$0
7 (vision, ~$0.40/1K on Gemini Flash)$4 (10K photos)$0.40 (delta)
8 (HS, free)$0$0
Total~$5~$1

Output artifacts

  • /tmp/investorlift-all-deals.json — full IL API response (all states)
  • /tmp/il-{state}-fresh.json — state-filtered subset with metadata
  • /tmp/{state}-active-audit.json — per-gap raw_id arrays + counts
  • ~/.openclaw/workspace/reports/state-pipeline-{state}-YYYY-MM-DD.md — final report

Memory cross-references

  • feedback_audit_before_architect.md — sample raw data before claiming
  • feedback_no_assumptions.md — verify column meanings, don’t infer
  • feedback_live_over_memory.md — live API > Supabase cache for “current state”
  • feedback_il_enrichment_runs_on_mac_ultra.md — IL calls always via Mac
  • reference_il_marketplace_pipeline.md — IL endpoint discovery
  • project_voice_comps_skill_2026-05-05.md — VOICE-PB-2 (vision analysis spec used in Stage 7)
  • CLAUDE.md hard rule: “InvestorLift Scraping — ALWAYS Via AWS Mac”

Trigger keywords

User says any of:

  • “fresh pull {state} active deals”
  • “redo {state} pipeline”
  • “audit {state} active deals”
  • “{state} active deals pipeline”
  • “refresh active deals end to end”
  • “rerun {state} from scratch”