Gemma 4 — Local/Cloud LLM Delegation

Use this skill to delegate tasks to Google Gemma 4 when you want a fast, cheap, open-source model instead of Claude. Gemma 4 routes through the VPS Portkey proxy (:18900) → OpenRouter → google/gemma-4-26b-a4b-it or google/gemma-4-31b-it.

Cost: $0.06/ M in p u t (26 B) or$ 0.13/M (31B) via OpenRouter — 50× cheaper than Claude Sonnet.

When to use Gemma

Trigger on any of these:

Henry says “ask gemma”, “use gemma for this”, “/gemma”, “delegate to gemma”, “route to gemma”
High-volume bulk tasks: summarizing many documents, classifying leads, compressing RAG context
Science/reasoning questions where GPQA-level accuracy matters (Gemma 4 31B: 84.3% vs Claude Sonnet: 74.1%)
Long-context document processing (256K window)
Privacy-sensitive tasks you want processed by an open-weight model (Apache 2.0)
Draft generation for internal use (not client-facing)

When NOT to use Gemma

Multi-step tool chains (chained function calls fail across all Gemma 4 sizes)
Client-facing SMS, email, or outputs where one hallucinated name = deal-breaker
Real-time multi-turn conversation requiring coherent memory (E4B multi-turn failed 0% in enterprise benchmarks)
Knowledge after January 2025 training cutoff
Novel architecture decisions (use Claude for those)

Three tools

Tool	Model	Cost	Use
`gemma_ask`	Gemma 4 26B MoE	$0.06/M in	Quick Q&A, classification, summarization, translation, bulk drafts
`gemma_reason`	Gemma 4 31B	$0.13/M in	Science, math, multi-step analysis, GPQA-level reasoning
`gemma_code`	Gemma 4 26B MoE	$0.06/M in	Boilerplate, refactoring, SQL, JSON schema, regex

How to invoke from Claude

# Direct delegation (Henry's natural language)
"ask gemma: <question>"             → use gemma_ask
"use gemma to reason through: <X>"  → use gemma_reason
"have gemma write the code for: <X>"→ use gemma_code

# With context
gemma_ask(query="summarize this", context="<long doc content>")
gemma_reason(problem="analyze this RE deal", context="<deal data>")
gemma_code(task="write a SQL query for...", language="sql")

Routing architecture

Claude (you) → gemma_ask/reason/code MCP tool
    → gemma-mcp server (node ~/.openclaw/tools/gemma-mcp/server.js)
        → Portkey proxy (127.0.0.1:18900)
            → Portkey config pc-opencl-aaae2d (26B) or pc-opencl-d97353 (31B)
                → OpenRouter PRIMARY (google/gemma-4-26b-a4b-it, $0.06/M)
                → Google AI Studio FALLBACK (when key is renewed)

Semantic caching is active (3600s TTL) — repeated identical questions cost $0.

Mac Ultra Ollama (future — when SSH is resolved)

When openclaw-mac-ultra-1 SSH access is fixed:

ollama pull gemma4:26b-moe — adds local free tier
Update GEMMA_PROXY_URL to route through Ollama first, then OpenRouter
Estimated free inference for bulk tasks, no API spend

Benchmarks (Gemma 4 vs Claude Sonnet 4.6)

Benchmark	Gemma 4 31B	Claude Sonnet 4.6	Winner
GPQA Diamond	84.3%	74.1%	Gemma
MMLU Pro	85.2%	~80%	Gemma
HumanEval	81.8%	~92%	Claude
Arena ELO	1452 (#3 open)	proprietary	—

Quota

No hard quota — OpenRouter pay-per-use. At $0.06/M input:

1M tokens = $0.06 (roughly 750K words of input)
Typical RAG compression call: ~8K tokens = $0.0005
Daily heavy use (1000 calls × 8K tokens): ~$0.50/day

Quartz 4

Explorer

SKILL