Overview
A ManagedServer is the unit of fleet membership. Each remote node is registered, has a status (ONLINE / OFFLINE / UNKNOWN), and may optionally run a self-healing orchestrator that diagnoses and recovers from failures automatically.
The platform provides two ways to add a node:
- Connect Existing — you already have a VPS. You bring SSH credentials (and optionally an API URL/token and a gateway secret).
- Provision New — you only have SSH credentials. Grid's
install.shruns over SSH, lays down the platform, and auto-fillsapi_url/api_token.
Either path produces a ManagedServer row. From that point on, the server is part of the fleet and can be a target for transfers, deployments, and self-healing.
Use multi-server when you need to: spread workloads across multiple VPSes, repatriate a service that lives on a remote node, mix a control plane with edge nodes, or keep one source of truth while running compute closer to your users.
Architecture
A Grid fleet is a leader-elected cluster of ManagedServer records, all reading from a MeshNetwork of WireGuardPeer entries. The local primary is the control plane; remote nodes are either full-stack followers or lightweight agents.
Roles at a glance
| Role | What it does |
|---|---|
| Master / Controller | Control plane. Runs PostgreSQL, Redis, RabbitMQ, Caddy, frontend, and the management API. Holds the source of truth and the leader-election term. Exactly one per cluster. |
| Follower (Full Node) | Remote ManagedServer that runs the entire platform stack locally — its own Traefik, RabbitMQ, and (optionally) PostgreSQL — but no frontend or Caddy. |
| Lite Agent | Compute-only worker that does not run a local database. Connects to the master's PostgreSQL, RabbitMQ, and Redis over the WireGuard mesh. |
Side-by-side comparison
| Property | Primary | Follower | Lite Agent |
|---|---|---|---|
| Runs PostgreSQL | Yes | Yes (own) | No (uses master) |
| Runs Caddy | Yes | No (Traefik) | No (Traefik) |
| Accepts workloads | No (control plane) | Yes | Yes |
| WireGuard mesh | Local peer | Member | Member (mandatory) |
| Cluster role | LEADER | FOLLOWER | FOLLOWER |
| Connection strategy | Direct | Token + HMAC V2 fallback | Local-DB reads + mesh-VPN upstream |
Node Modes
Primary (Master)
The master is the source of truth and the orchestrator. It is installed by running install.sh with no --mode flag, which produces the default platform stack. The installer writes a NODE_TYPE=master marker into .env and a corresponding ManagedServer row with is_primary=True, allow_user_workloads=False.
The master hosts the WireGuard default mesh as the local peer, issues API tokens and gateway secrets, owns the leader-election term, and holds the encryption keys used by all nodes (Fernet-encrypted credentials, BACKUP_ENCRYPTION_KEY).
Follower (Full-Stack Node)
A follower is a ManagedServer with is_primary=False, is_lite_agent=False, allow_user_workloads=True. It runs its own Docker Compose stack using docker-compose.prod.yml (no frontend, no Caddy) and serves containers via Traefik on port 80.
Use followers when the remote VPS has enough resources to run its own database and broker, when each region should be self-contained for performance or data-residency reasons, or when you are running a multi-tenant fleet and want to isolate tenants onto dedicated hosts.
Lite Agent
A Lite Agent is a ManagedServer with is_lite_agent=True. It runs docker-compose.agent-lite.yml: a subset of the platform that includes the backend, worker, and a local Redis/RabbitMQ, but not PostgreSQL. The agent's database connection points at the master over the WireGuard mesh (MASTER_MESH_IP), and its reads (services, deployments) hit the shared master database directly rather than through a proxy.
Use Lite Agents when the remote VPS is small (1-2 vCPU, 1-2 GB RAM) and you do not want to run PostgreSQL on it, when the agent is in a private subnet and can reach the master over WireGuard but not the public internet, or when you want to add a node quickly without provisioning a database.
Connecting a Server
Connect an existing VPS (UI)
- Open Servers in the sidebar and click Connect Existing.
- Enter a friendly name, the public IP or domain, and (optionally) the private IP for the WireGuard endpoint.
- Choose an auth strategy: API + token, API + gateway secret (HMAC), or SSH only.
- Set
is_primary=False(the default) andallow_user_workloads=Trueto make the node a workload target. - Submit. A background thread runs a health refresh: probes candidate API URLs, detects the platform version, exchanges a token if needed, and updates status and WireGuard mesh membership.
Provision a new VPS (UI)
- Open Servers and click Provision New.
- Enter name, public IP, SSH port, SSH user, and either a password or a PEM-encoded private key.
- Optionally toggle
is_lite_agent=Trueto install the agent-lite compose profile instead of the full stack. - Submit. The installer runs over SSH, lays down the platform, and auto-fills
api_urlandapi_tokenon the server record.
Connect an existing VPS (API)
curl -sS http://localhost:8000/api/v1/servers/ \
-H "Authorization: Token $SMSLY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Worker EU",
"host": "203.0.113.10",
"private_ip": "10.0.5.10",
"api_url": "http://203.0.113.10:8090",
"api_token": "smsly_…",
"ssh_user": "root",
"ssh_password": "REDACTED",
"is_primary": false,
"allow_user_workloads": true
}'Provision a new VPS (API)
curl -sS -X POST http://localhost:8000/api/v1/servers/provision/ \
-H "Authorization: Token $SMSLY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Worker US",
"host": "198.51.100.20",
"ssh_user": "root",
"ssh_auth_method": "key",
"ssh_key": "-----BEGIN OPENSSH PRIVATE KEY-----\n…\n-----END OPENSSH PRIVATE KEY-----",
"is_primary": false,
"allow_user_workloads": true,
"is_lite_agent": true
}'Prerequisites: Root SSH access, TCP/22 reachable from the master, a supported Linux distribution (Ubuntu 20.04 / 22.04 / 24.04 LTS), and at least 2 vCPU / 4 GB RAM for a follower (1 vCPU / 1 GB RAM for a Lite Agent).
Full API reference
See docs/multi-server.md in the repository for every endpoint, request body, response field, and error code — including the proxy/, heal/, diagnostics/, and run_command/ endpoints.
API Reference
All endpoints are mounted under /api/v1/servers/. Authentication is session- or token-based for user endpoints, and HMAC V2-signed for the internal node-to-node sync endpoints. Filter by ?status=ONLINE|OFFLINE|UNKNOWN on the list endpoint.
| Method & Path | Purpose |
|---|---|
| GET /servers/ | List servers. Filter with ?status=…. |
| POST /servers/ | Connect an existing server (Connect Existing). |
| GET /servers/{id}/ | Retrieve a server. |
| PATCH /servers/{id}/ | Partial update (rotate credentials, toggle workloads). |
| DELETE /servers/{id}/ | Remove from the fleet. |
| POST /servers/provision/ | Provision a brand-new node over SSH. |
| GET /servers/{id}/provision-logs/ | Stream live provisioning logs. |
| POST /servers/{id}/retry-provision/ | Re-run the idempotent installer. |
| POST /servers/{id}/update-server/ | Run the installer for an in-place update. |
| POST /servers/{id}/health_check/ | Probe a single server's API. |
| POST /servers/check_all/ | Health probe every server. |
| POST /servers/{id}/proxy/ | Forward a generic API request to a remote. |
| GET /servers/{id}/services/ | List services on a managed server. |
| GET /servers/{id}/deployments/ | List recent deployments on a managed server. |
| GET /servers/{id}/domains/ | Aggregate custom domains on a managed server. |
| POST /servers/{id}/heal/ | Trigger self-healing (action: restart, diagnose, full). |
| GET /servers/{id}/diagnostics/ | Read-only diagnostics snapshot. |
| POST /servers/{id}/run_command/ | Run an allow-listed diagnostic command over SSH. |
Example: full node heal
curl -sS -X POST \
http://localhost:8000/api/v1/servers/7d3b1a8e-2c5f-4a6d-8e9b-0c1a2b3c4d5e/heal/ \
-H "Authorization: Token $SMSLY_TOKEN" \
-H "Content-Type: application/json" \
-d '{ "action": "full" }'Self-Healing
The self-healing orchestrator classifies failures into FailureType enums and chooses a RecoveryAction. For node-level heals, the user-facing action is mapped to the orchestrator's recovery surface.
| User-facing action | Internal action | What it does |
|---|---|---|
| restart_container | RESTART_CONTAINER | docker restart <container> and re-checks after 20s. |
| restart_stack | RESTART_STACK | docker compose up -d in /opt/smsly-hosting. |
| restart_docker_daemon | RESTART_DOCKER_DAEMON | systemctl restart docker and verifies with docker info. |
| diagnose | — | Read-only diagnostics. No recovery. |
| full | RESTART_STACK | Node-level: restart_stack. Deployment-level: walks the suggested-action chain and escalates to AI after 5 attempts. |
Cooldowns enforce HEAL_COOLDOWN_SECONDS=120 (no new heal for the same scope within two minutes) and MAX_HEAL_ATTEMPTS=5 (after five attempts the orchestrator returns ESCALATE_TO_AI).
Security
The inter-node surface is hardened at five layers.
HMAC V2 signing
Every node-to-node call carries three headers:
X-SMSLY-Remote-Sync: 1— declares the request as a node-to-node sync.X-Request-Timestamp— UNIX seconds, must be within 300 seconds of the receiver's clock.X-Gateway-Signature-V2— HMAC-SHA256 overMETHOD|path|ts|sha256(body)using either the per-nodegateway_secretor, as a last-resort fallback, the platform-wideGATEWAY_SECRET. Comparison uses constant-timehmac.compare_digest.
Token auth
For nodes where an API token has already been exchanged, the dashboard uses Authorization: Token <smsly_…>. Tokens are matched against the SHA-256 hash stored on the APIToken row and are revocable.
Command allow-list
POST /run_command/ enforces a strict prefix allow-list. Allowed prefixes: docker , cd /opt/smsly-hosting && docker , df , free , ping , systemctl status docker, and a redacted read of the local .env. Anything else returns 403.
Encrypted credentials
api_token, gateway_secret, ssh_password, and ssh_key are all stored in EncryptedCharField / EncryptedTextField (Fernet) on the ManagedServer model. They are never returned by the API. The has_ssh_credentials boolean is the only credential-derived field in the public serializer.
Audit trail
Every meaningful state change is recorded through log_event(...) with a stable action code and a metadata payload. The AuditLog table is hash-chained and protected by BEFORE UPDATE OR DELETE triggers so audit records cannot be silently tampered with.
Troubleshooting
"Server 'X' is currently OFFLINE. Transfers are only allowed to ONLINE nodes."
The connected server is registered but the health probe has not received a non-5xx response recently. Run POST /servers/{id}/health_check/ and watch for which candidate URL succeeds. The most common causes are a wrong public IP, a firewall blocking port 8090, or the WireGuard mesh not yet converged.
Mesh deploy fails with "WireGuard kernel module is not loaded on the host VPS"
The remote kernel does not have the wireguard module. SSH into the host, run sudo modprobe wireguard, and re-queue the mesh deploy. On hosts without DKMS, the module is provided by the kernel itself on most Ubuntu LTS images; on custom kernels, install wireguard-dkms and reboot.
Token auto-exchange fails with 401 / 403
The remote rejected the bootstrap. Verify that gateway_secret on the source matches GATEWAY_SECRET on the target. If the remote uses credential exchange, ensure ALLOW_REMOTE_PASSWORD_EXCHANGE=1 on the target and that the SSH password is the admin password.
"Provisioning FAILED — INSTALLATION FAILED"
The remote installer exited non-zero. Open provision_logs for the full stdout. Common causes: an unsupported Linux distribution, no Docker installable, no apt-get or yum present, or insufficient RAM. Re-run with retry-provision after fixing the underlying issue — the script is idempotent.
Self-heal never converges
MAX_HEAL_ATTEMPTS=5 triggers after the fifth attempt. The orchestrator returns next_action=ESCALATE_TO_AI. If the platform intelligence is configured, the AI Senate analyzes the diagnostic context and proposes commands. Otherwise the heal log is the only artifact — open it from the heal endpoint and address the root cause manually.
A remote node is "ONLINE" but the proxy returns remote_unreachable
The health probe found a working base URL, but the proxy candidate-URL rotation tried a different URL and the remote is no longer answering. The proxy falls through the candidate list with multiple auth modes (token, then HMAC, then none) and surfaces remote_unreachable=true with the upstream error. Usually transient — re-run the call.
Domain aggregation truncates at 50 pages
The full-follower implementation paginates through /api/v1/services/ with a hard cap of 50 pages. A node with more than 50 pages of services (≥500 services at the default page size) will not have all of its domains listed. Use the per-service /services/ endpoint for exhaustive listings, or the master DB directly for Lite Agents.