Gemma 4 — Local/Cloud LLM Delegation

Use this skill to delegate tasks to Google Gemma 4 when you want a fast, cheap, open-source model instead of Claude. Gemma 4 routes through the VPS Portkey proxy (:18900) → OpenRouter → google/gemma-4-26b-a4b-it or google/gemma-4-31b-it.

Cost: 0.13/M (31B) via OpenRouter — 50× cheaper than Claude Sonnet.

When to use Gemma

Trigger on any of these:

  • Henry says “ask gemma”, “use gemma for this”, “/gemma”, “delegate to gemma”, “route to gemma”
  • High-volume bulk tasks: summarizing many documents, classifying leads, compressing RAG context
  • Science/reasoning questions where GPQA-level accuracy matters (Gemma 4 31B: 84.3% vs Claude Sonnet: 74.1%)
  • Long-context document processing (256K window)
  • Privacy-sensitive tasks you want processed by an open-weight model (Apache 2.0)
  • Draft generation for internal use (not client-facing)

When NOT to use Gemma

  • Multi-step tool chains (chained function calls fail across all Gemma 4 sizes)
  • Client-facing SMS, email, or outputs where one hallucinated name = deal-breaker
  • Real-time multi-turn conversation requiring coherent memory (E4B multi-turn failed 0% in enterprise benchmarks)
  • Knowledge after January 2025 training cutoff
  • Novel architecture decisions (use Claude for those)

Three tools

ToolModelCostUse
gemma_askGemma 4 26B MoE$0.06/M inQuick Q&A, classification, summarization, translation, bulk drafts
gemma_reasonGemma 4 31B$0.13/M inScience, math, multi-step analysis, GPQA-level reasoning
gemma_codeGemma 4 26B MoE$0.06/M inBoilerplate, refactoring, SQL, JSON schema, regex

How to invoke from Claude

# Direct delegation (Henry's natural language)
"ask gemma: <question>"             → use gemma_ask
"use gemma to reason through: <X>"  → use gemma_reason
"have gemma write the code for: <X>"→ use gemma_code

# With context
gemma_ask(query="summarize this", context="<long doc content>")
gemma_reason(problem="analyze this RE deal", context="<deal data>")
gemma_code(task="write a SQL query for...", language="sql")

Routing architecture

Claude (you) → gemma_ask/reason/code MCP tool
    → gemma-mcp server (node ~/.openclaw/tools/gemma-mcp/server.js)
        → Portkey proxy (127.0.0.1:18900)
            → Portkey config pc-opencl-aaae2d (26B) or pc-opencl-d97353 (31B)
                → OpenRouter PRIMARY (google/gemma-4-26b-a4b-it, $0.06/M)
                → Google AI Studio FALLBACK (when key is renewed)

Semantic caching is active (3600s TTL) — repeated identical questions cost $0.

Mac Ultra Ollama (future — when SSH is resolved)

When openclaw-mac-ultra-1 SSH access is fixed:

  • ollama pull gemma4:26b-moe — adds local free tier
  • Update GEMMA_PROXY_URL to route through Ollama first, then OpenRouter
  • Estimated free inference for bulk tasks, no API spend

Benchmarks (Gemma 4 vs Claude Sonnet 4.6)

BenchmarkGemma 4 31BClaude Sonnet 4.6Winner
GPQA Diamond84.3%74.1%Gemma
MMLU Pro85.2%~80%Gemma
HumanEval81.8%~92%Claude
Arena ELO1452 (#3 open)proprietary

Quota

No hard quota — OpenRouter pay-per-use. At $0.06/M input:

  • 1M tokens = $0.06 (roughly 750K words of input)
  • Typical RAG compression call: ~8K tokens = $0.0005
  • Daily heavy use (1000 calls × 8K tokens): ~$0.50/day