Hetzner Hub
Hetzner provides the primary VPS substrate for all of OpenClaw. The single production VPS (srv1347501.hstgr.cloud) hosts the gateway, 36+ agents, all webhook handlers, cron jobs, and memory DBs. This hub documents the compute tier, right-sizing history, and the Reboot #5 incident. Read before making any infrastructure changes, service deployments, or scaling decisions. KB directory does not exist — this hub is authored from CLAUDE.md, ARCHITECTURE.md, and memory entries per the Wave 1 spec.
Quick reference
| Field | Value |
|---|---|
| Vendor | Hetzner (via Hostinger shared naming in DNS — VPS is Hetzner CCX-class) |
| URL | https://console.hetzner.cloud |
| Dashboard | https://console.hetzner.cloud |
| KB doc | SOURCE MISSING — no workspace/knowledge-base/hetzner/ directory. Sources: CLAUDE.md + ARCHITECTURE.md |
| Auth method | API Token (Bearer) + web login |
| Auth credential | op://Aurora/hetzner/login (web), op://Aurora/hetzner/api-token (API) |
| Cred-proxy port | n/a |
| Webhook port | n/a |
| Webhook handler | n/a |
| Webhook dedup table | n/a |
| Tunnel path | n/a |
| Outbound API base | https://api.hetzner.cloud/v1 |
| Rate limits | 3,600 req/hr per token (Hetzner Cloud API) |
| Rate-limit action | 429 → exponential backoff (3 retries), Discord ops alert |
| Cost | CCX43 ~102/mo |
| Backup/recovery | Hetzner snapshots (manual); no automated snapshot confirmed |
| Current instance | CCX43 (8 vCPU / 32 GB RAM, ~$168.84/mo) |
| Right-sized target | CCX33 (4 vCPU / 32 GB RAM — working set 6.5 GB, CCX33 right-sized) |
| VPS hostname | srv1347501.hstgr.cloud |
| Tailscale hostname | srv1347501.tailb025a7.ts.net |
| Region | Hetzner Cloud — single region (location pending drift check) |
| EU candidate | Hetzner Frankfurt CCX11 — €5.10/mo (BetterBets EU entity / Estonia OÜ node) |
| Discord alert channel | ops |
| Drift cadence | Daily — tool-calls-health-check.timer + G-FAILED-SERVICE-MTTR cron |
| Status | production |
VPS substrate
All 36+ OpenClaw agents, the gateway (:18789), Portkey proxy (:18900), webhook handlers (:18790–:18803), and 44 SQLite memory DBs run on this single VPS. The architecture is intentionally single-node for cost efficiency; the long-term migration path is Mac mini (see Reboot #5 incident below).
Resource profile
| Resource | Current (CCX43) | Working set | Right-sized (CCX33) |
|---|---|---|---|
| vCPU | 8 | — | 4 |
| RAM | 32 GB | 6.5 GB active | 32 GB |
| Monthly cost | ~$168.84 | — | ~$102.00 |
Finding: The 6.5 GB working set fits comfortably in CCX33 (32 GB RAM). CCX43 was provisioned during a growth phase and is now over-sized. Downsize recommendation = stay on VPS → Mac mini long-term path. See project_vps_reboot_5_internal_cascade_F1F4_2026-05-01 for the full incident analysis.
Key paths on VPS
| Path | Purpose |
|---|---|
/home/opsadmin/.openclaw/ | All agent configs, scripts, workspace, memory DBs |
/home/opsadmin/.openclaw/memory/*.sqlite | 44 SQLite agent memory DBs (verified 2026-05-01) |
/home/opsadmin/.openclaw/workspace/ | Scripts, webhooks, knowledge base, plans |
/tmp/openclaw/ | Runtime logs, fallback JSONL queues |
~/.ssh/ | SSH keys including openclaw-mac.pem for EC2 Mac |
Reboot #5 — Internal oomd cascade (F1-F4 applied 2026-05-01)
Incident date: 2026-05-01
Root cause: Internal OOM daemon (oomd) cascade — NOT external hardware failure. The VPS kernel OOM killer triggered a cascade termination of OpenClaw processes when memory pressure exceeded oomd thresholds.
Contributing factors:
- CCX43 was over-provisioned but processes were not memory-limited (no cgroup limits)
- Multiple agents running parallel LLM calls created spike memory pressure
oomdthreshold tuning not applied before incident
F1-F4 fixes applied:
| Fix | Description |
|---|---|
| F1 | Set per-service MemoryHigh= + MemoryMax= cgroup limits in systemd units |
| F2 | Tuned oomd thresholds to be less aggressive during LLM call spikes |
| F3 | Added RestartOnFailure watchdog to critical services (gateway, portkey-proxy) |
| F4 | Added /tmp/openclaw/tool-calls-fallback.jsonl drain path for CHOKEPOINT-1 when Postgres unreachable post-cascade |
Post-incident recommendation: Stay on VPS (CCX33 right-sized) → eventual migration to Mac mini. The working set (6.5 GB) is well within CCX33 capacity.
Memory reference: project_vps_reboot_5_internal_cascade_F1F4_2026-05-01 · feedback_substrate_right_size_to_working_set
Service inventory
The full service inventory is not yet canonical — CLAUDE.md §Webhook Service Port Map documents 9 of ~41 live services (audit 2026-05-01). Phase 1.5 ships workspace/port-registry.md as the authoritative source.
Query live state:
systemctl --user list-units --state=active
sudo systemctl list-units --state=active | grep openclaw
pm2 list
ss -tlnpG-FAILED-SERVICE-MTTR: any service in failed state for >24h must be (a) fixed, (b) explicitly disabled, or (c) archived per feedback_archive_not_delete. Daily cron Discord-alerts #ops on failures.
G-SERVICE-PRE-START-DOC: new units must be added to CLAUDE.md port map AND workspace/ARCHITECTURE.md BEFORE first start.
EU node candidate — Hetzner Frankfurt
As part of the BetterBets EU entity / Estonia OÜ plan (project_betterbets_eu_entity_estonia_oue), a Hetzner Frankfurt node is being evaluated:
| Property | Value |
|---|---|
| Location | Hetzner Frankfurt (Germany, EU) |
| Instance | CCX11 (2 vCPU / 4 GB RAM) |
| Cost | €5.10/mo |
| Purpose | EU regulatory compliance node for Binance.com + prediction markets |
| Status | PLANNED — 5 blockers open (B1 CPA review, B2 Binance UBO policy, B3 PR Act 60, B4 Polymarket legal, B5 Binance×PR) |
Cannot trade today. The EU node is a dependency for EU-regulated trading operations but is not yet provisioned. Refer to betterbets-eu-entity-estonia-oue for current blocker status.
Components
workspace/ARCHITECTURE.md— system architecture reference (has drift note: 25 days stale as of 2026-05-02)workspace/scripts/security-audit-funnel.js— weekly audit (runs on this VPS)systemctl --user list-units— authoritative live service inventory~/.openclaw/tools/openclaw-vault-sync.sh— vault rsync to GitHub every 15 min~/.openclaw/tools/openclaw-vault-pull.sh— vault pull from GitHub every 5 minworkspace/port-registry.md— PLANNED (Phase 1.5); not yet built
How it’s used
- Trigger: all OpenClaw workloads run on this VPS — it IS the substrate
- Flow: Internet → Cloudflare Tunnel (edge, see cloudflare) → VPS → handler ports → agent dispatch via gateway → LLM via Portkey → Discord response
- Agents involved: all 36+ agents — the VPS is their execution environment
- Failure mode: VPS crash → total outage. F1-F4 reduce cascade risk. Recovery: Hetzner console reboot → services restart via systemd
Restart=on-failurewatchdog. - Success criteria:
systemctl --user status openclaw-gateway portkey-proxybothactive (running); gateway logs flowing at/tmp/openclaw/openclaw-$(date +%Y-%m-%d).log
Cross-links
Agents that touch this
- _summary — primary builder, runs on this VPS
- _summary — infrastructure oversight, service monitoring
- _summary — central orchestration, 80%+ of cron jobs
Plans that govern this
- openclaw-fragmentation-fix-2026-05-01 — Phase 1.5 ships port-registry.md
- project_vps_reboot_5_internal_cascade_F1F4_2026-05-01 — Reboot #5 incident + F1-F4
Feedback rules
- feedback_substrate_right_size_to_working_set — size instances to working set, not peak
- feedback_service_pre_start_doc — G-SERVICE-PRE-START-DOC: document before start
- feedback_failed_service_mttr — G-FAILED-SERVICE-MTTR: 24h max tolerated downtime
- feedback_archive_not_delete — archive retired services, don’t delete
- reference_hetzner_credentials_1p — Hetzner web login + API token in 1Password vault Aurora
KB / source docs
- CLAUDE.md §Webhook Service Port Map — partial service table (9/41 services documented)
- CLAUDE.md §Vault Sync Timers — vault-sync + vault-pull systemd timers
workspace/ARCHITECTURE.md— has 25-day drift note; live state viasystemctl+ss+pm2- SOURCE MISSING:
workspace/knowledge-base/hetzner/— no KB dir. Authored from CLAUDE.md + ARCHITECTURE.md per Wave 1 spec.
System maps
- infrastructure — full infra topology
- vm-overview — VPS service map
Related: Infra/compute cluster
This hub is the anchor for the Infra/compute cluster:
- cloudflare — public edge, tunnel, WAF, DNS
- aws — EC2 Mac Ultra (arm64), S3 buckets, IL scraping dependency
- github — vault backup (
traewayrer/openclaw-vault), CI/CD
Cost context: CCX43 (~20/mo) + AWS Mac Ultra + misc = primary infra cost center. CCX33 right-sizing saves ~$66/mo. Full cost tracking: cost-tracking.
Related: Credential layer cluster
Hetzner credentials live in 1Password vault Aurora.
- 1password — credential vault anchor
- Cred refs:
op://Aurora/hetzner/api-token+op://Aurora/hetzner/login - Memory: reference_hetzner_credentials_1p
Open issues / TODOs
- CCX43 → CCX33 downsize pending — saves ~$66/mo; requires coordinated downtime window
- Phase 1.5
workspace/port-registry.mdnot yet built — 23 of 41 services undocumented (G-SERVICE-PRE-START-DOC technical debt) - Hetzner KB dir missing — create
workspace/knowledge-base/hetzner/+ populate API.md and add to CLAUDE.md platforms list per G-KB-SYNC-WITH-CLAUDEMD - EU Frankfurt node — PLANNED, blocked on 5 BetterBets EU blockers
- Mission Control Docker containers (
:3000,:8000,:5432,:6380) exposed on0.0.0.0— should be localhost-only per ARCHITECTURE.md security note
Recent activity
- 2026-05-03: hub created (W1-S7 sub-agent) — sourced from CLAUDE.md + ARCHITECTURE.md (KB dir missing)
- 2026-05-01: Reboot #5 internal oomd cascade; F1-F4 fixes applied; CCX43 right-size analysis complete
- 2026-05-01: Phase D fragmentation-fix audit; 37 live services found vs 9 documented