Gemma 4 — Local/Cloud LLM Delegation
Use this skill to delegate tasks to Google Gemma 4 when you want a fast, cheap, open-source model instead of Claude. Gemma 4 routes through the VPS Portkey proxy (:18900) → OpenRouter → google/gemma-4-26b-a4b-it or google/gemma-4-31b-it.
Cost: 0.13/M (31B) via OpenRouter — 50× cheaper than Claude Sonnet.
When to use Gemma
Trigger on any of these:
- Henry says “ask gemma”, “use gemma for this”, “/gemma”, “delegate to gemma”, “route to gemma”
- High-volume bulk tasks: summarizing many documents, classifying leads, compressing RAG context
- Science/reasoning questions where GPQA-level accuracy matters (Gemma 4 31B: 84.3% vs Claude Sonnet: 74.1%)
- Long-context document processing (256K window)
- Privacy-sensitive tasks you want processed by an open-weight model (Apache 2.0)
- Draft generation for internal use (not client-facing)
When NOT to use Gemma
- Multi-step tool chains (chained function calls fail across all Gemma 4 sizes)
- Client-facing SMS, email, or outputs where one hallucinated name = deal-breaker
- Real-time multi-turn conversation requiring coherent memory (E4B multi-turn failed 0% in enterprise benchmarks)
- Knowledge after January 2025 training cutoff
- Novel architecture decisions (use Claude for those)
Three tools
| Tool | Model | Cost | Use |
|---|---|---|---|
gemma_ask | Gemma 4 26B MoE | $0.06/M in | Quick Q&A, classification, summarization, translation, bulk drafts |
gemma_reason | Gemma 4 31B | $0.13/M in | Science, math, multi-step analysis, GPQA-level reasoning |
gemma_code | Gemma 4 26B MoE | $0.06/M in | Boilerplate, refactoring, SQL, JSON schema, regex |
How to invoke from Claude
# Direct delegation (Henry's natural language)
"ask gemma: <question>" → use gemma_ask
"use gemma to reason through: <X>" → use gemma_reason
"have gemma write the code for: <X>"→ use gemma_code
# With context
gemma_ask(query="summarize this", context="<long doc content>")
gemma_reason(problem="analyze this RE deal", context="<deal data>")
gemma_code(task="write a SQL query for...", language="sql")
Routing architecture
Claude (you) → gemma_ask/reason/code MCP tool
→ gemma-mcp server (node ~/.openclaw/tools/gemma-mcp/server.js)
→ Portkey proxy (127.0.0.1:18900)
→ Portkey config pc-opencl-aaae2d (26B) or pc-opencl-d97353 (31B)
→ OpenRouter PRIMARY (google/gemma-4-26b-a4b-it, $0.06/M)
→ Google AI Studio FALLBACK (when key is renewed)
Semantic caching is active (3600s TTL) — repeated identical questions cost $0.
Mac Ultra Ollama (future — when SSH is resolved)
When openclaw-mac-ultra-1 SSH access is fixed:
ollama pull gemma4:26b-moe— adds local free tier- Update
GEMMA_PROXY_URLto route through Ollama first, then OpenRouter - Estimated free inference for bulk tasks, no API spend
Benchmarks (Gemma 4 vs Claude Sonnet 4.6)
| Benchmark | Gemma 4 31B | Claude Sonnet 4.6 | Winner |
|---|---|---|---|
| GPQA Diamond | 84.3% | 74.1% | Gemma |
| MMLU Pro | 85.2% | ~80% | Gemma |
| HumanEval | 81.8% | ~92% | Claude |
| Arena ELO | 1452 (#3 open) | proprietary | — |
Quota
No hard quota — OpenRouter pay-per-use. At $0.06/M input:
- 1M tokens = $0.06 (roughly 750K words of input)
- Typical RAG compression call: ~8K tokens = $0.0005
- Daily heavy use (1000 calls × 8K tokens): ~$0.50/day