OSIL Decision Tree

Which optimizer to reach for given a problem class.

flowchart TD
    START[Want to improve agent?] --> Q1{What's the problem?}

    Q1 -->|Repeated failures<br/>same error class| REF[Reflexion Runner<br/>captures+retrieves reflections]
    Q1 -->|Suboptimal prompt<br/>known eval set exists| DSPY[DSPy + GEPA<br/>genetic-Pareto evolution]
    Q1 -->|Need new capability<br/>repeated workflow| SKI[Skill Induction Worker<br/>Voyager pattern]
    Q1 -->|Unknown bottleneck<br/>measurable quality metric| AUTO[Karpathy autoresearch<br/>overnight loops]
    Q1 -->|Memory recall is weak| MEM{Memory tier}
    Q1 -->|Cross-session learning gap| SK[peterskoett OpenClaw skill]
    Q1 -->|Code-execution failures| CRIT[CRITIC pattern<br/>execution feedback loop]
    Q1 -->|Quality on creative output| SR[Self-Refine pattern<br/>in-episode iteration]

    MEM -->|Personalization| HON[Honcho]
    MEM -->|Production scale| M0[Mem0]
    MEM -->|Long-horizon| LET[Letta]

    DSPY -->|Eval set missing| EVAL[Build eval set first<br/>Phase 0 prerequisite]
    DSPY -->|Plateau on prompt opt| BFT[BootstrapFinetune<br/>or M1 Ultra distillation]
    DSPY -->|Want gradient signal| TG[TextGrad complement]

Decision criteria

  • Reflexion → high failure rate, similar errors recur. Cost: hourly scan + reflection LLM call.
  • DSPy + GEPA → eval set exists, prompt optimization desired. Cost: 35x fewer rollouts than RL alternatives.
  • Skill Induction → workflow patterns repeat with no codified skill. Cost: nightly scan.
  • Autoresearch → measurable quality metric, unknown optimization space. Cost: overnight LLM compute.
  • Memory eval → retrieval precision is the bottleneck. Cost: pilot only — augment, don’t replace.
  • peterskoett skill → cross-session learning is fragmented. Cost: zero (drop-in).
  • CRITIC → agent generates code with execution failures. Cost: sandbox + refinement cycles.
  • Self-Refine → quality matters more than speed. Cost: 2-3x LLM calls per output.