Back to Docs
Multi-Server Guide

Server Transfers

Move services between nodes in your Grid fleet. Drag-and-drop in the UI, or drive the pipeline from the API.

Overview

A server transfer moves a running workload from one Grid node to another with minimal downtime. The pipeline captures a snapshot of the source service, ships it to the target over SSH, restores it, and (when applicable) updates DNS so traffic follows the container to its new host.

Transfers run as background tasks. The API returns a new transfer record immediately; progress and live logs are polled through GET /api/v1/transfers/.

Common reasons to use transfers:

  • Rebalancing workloads across a multi-server fleet.
  • Moving a service off a primary/control-plane node to a dedicated worker.
  • Migrating from one Grid host to another (full server transfer).
  • Repatriating a service that was previously running on a remote node.

Before you start: Connect the target server under Servers → Connect Existing with its IP/domain and SSH credentials. Only workload-enabled servers (allow_user_workloads=True, is_primary=False) appear as transfer targets in the UI.

Transfer Types

TypeScopeUse when
SERVICEOne service (and its addons, by association)Moving a single workload between two nodes. Addons follow their parent service automatically.
FULLEntire platform (database, all services, configuration)Migrating a complete Grid instance. The target is reinstalled with install.sh and the platform database is restored.

Choose SERVICE for the common case. Use FULL only when relocating the entire platform — not individual workloads.

Prerequisites

  • Target server is registered and ONLINE. Connected under Servers with its public IP/domain and SSH credentials.
  • SSH credentials are available. Either stored on the connected target server, or supplied inline. Both SSH keys (PEM-encoded private key) and passwords are supported; password takes precedence if both are present.
  • Target is reachable on TCP/22. Bidirectional reachability is recommended so the target can confirm connectivity back to the source.
  • Target has a working Grid backend (for SERVICE transfers). The transfer engine starts it if it is down.
  • Domain is configured on the source (for automatic DNS cutover). Requires PlatformConfig.cloudflare_api_token and PlatformConfig.domain.
  • Encryption key is set on the source (BACKUP_ENCRYPTION_KEY) if any of its backups are encrypted.

How to Use

Drag-and-drop (Transfers page)

  1. Open Transfers in the sidebar. Connected workload-enabled servers appear as columns; the local primary node appears on the left.
  2. Optional: enter a New domain in the top bar. This sets target_public_domain for cross-platform migration (the service's public_domain is rewritten to <subdomain>.<target_domain> after the transfer completes).
  3. Drag a service or addon from one column and drop it onto the target column. Addons are moved by moving their parent service.
  4. The UI optimistically updates immediately and POSTs the transfer. The transfer enters the pipeline and begins progressing through its stages.
  5. Watch the Active Stream panel on the right. Each in-progress transfer shows a progress bar, current step, and the live status. The list polls every 5 seconds.
  6. When the status reaches COMPLETED, the service is live on the target. A Rollback button is available for 48 hours.
  7. To abort a transfer that has not yet completed, click Cancel. The transfer moves to CANCELLED and the source workload is left untouched.

API (scriptable)

The minimal flow:

  1. Resolve the target target_server_id (UUID of the connected ManagedServer) and the source service_id (UUID of the Service record).
  2. POST the transfer request to /api/v1/transfers/ with transfer_type, service_id, source_server_id, and target_server_id.
  3. Poll for status. GET /api/v1/transfers/{id}/ returns status, progress_percent, current_step, and live logs.
  4. Decide follow-up. When status is COMPLETED, optionally POST /api/v1/transfers/{id}/rollback/ to revert. When status is FAILED mid-pipeline, the source workload remains in place.

Create a transfer

curl -sS http://localhost:8000/api/v1/transfers/ \
  -H "Authorization: Token $SMSLY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "transfer_type": "SERVICE",
    "service_id": "9c8b4b1a-7d1c-4a2b-9a55-2e8c3d4f9b21",
    "target_server_id": "7d3b1a8e-2c5f-4a6d-8e9b-0c1a2b3c4d5e"
  }'

Roll back a completed transfer

curl -sS -X POST \
  http://localhost:8000/api/v1/transfers/1f4a2c63-9b6e-4f01-b6a5-7c5d0a44a1a9/rollback/ \
  -H "Authorization: Token $SMSLY_TOKEN"

Cancel an in-progress transfer

curl -sS -X POST \
  http://localhost:8000/api/v1/transfers/1f4a2c63-9b6e-4f01-b6a5-7c5d0a44a1a9/cancel/ \
  -H "Authorization: Token $SMSLY_TOKEN"

Full API reference

See docs/transfers.md in the repository for every endpoint, request body, response field, and error code — including the internal register-incoming/ node-to-node sync endpoint.

Status Reference

A transfer transitions through the following pipeline. Each stage persists a progress_percent and a current_step so the UI can render a live progress bar without polling logs.

PREPARING  →  UPLOADING  →  RESTORING  →  DNS_CUTOVER  →  VERIFYING  →  COMPLETED
                                                                       │
                                                                       ├── ROLLED_BACK  (manual revert)
                                                                       └── FAILED       (any stage can short-circuit here)
StatusWhat happensTerminal?
PREPARINGSource backup is created. On the target, Docker is verified and the Grid backend is started if needed.No
UPLOADINGBackup is shipped to the target over SSH. For FULL, install.sh and .env are also uploaded.No
RESTORINGTarget unpacks the backup, hydrates the database row, loads the Docker image, restores volumes, and starts the container.No
DNS_CUTOVERCloudflare A records are updated for FULL (apex + wildcard) or, for SERVICE on a Lite Agent target, a per-service A record is created.No
VERIFYINGHealth checks run on the target. WireGuard mesh is interconnected so source and target can communicate post-cutover.No
COMPLETEDTransfer has finished. Service is reassigned to the target, the source container is stopped, and rollback_deadline is set to completed_at + 48h.Yes
FAILEDA stage errored. The source workload remains on the source node. error_message is set to a redacted, human-readable summary.Yes
ROLLED_BACKA successful transfer was reverted. The service is reassigned back to the source and DNS is restored.Yes
CANCELLEDA user cancelled an in-progress transfer. The source workload remains on the source node.Yes

PREPARING, UPLOADING, RESTORING, DNS_CUTOVER, and VERIFYING are the active statuses. Only one active transfer can exist for a given (owner, target_ip, transfer_type[, service]) tuple — creating a second one returns HTTP 409.

Security

Transfers handle SSH credentials and the ability to execute commands on remote hosts. The pipeline is hardened at three layers.

SSRF Protection

Public transfer requests validate the resolved target IP. Loopback, link-local, multicast, reserved, and unspecified ranges are always rejected. Private ranges (RFC 1918) are accepted only when the target is a known ManagedServer — this prevents an unauthenticated caller from coercing the backend into opening SSH connections to internal infrastructure.

HMAC Node-to-Node Auth

The POST /api/v1/transfers/register-incoming/ endpoint never accepts session or token credentials. It requires:

  • X-SMSLY-Remote-Sync: 1 — declares the request as a node-to-node sync.
  • X-Request-Timestamp — UNIX seconds, must be within 300 seconds of now.
  • X-Gateway-Signature-V2 — HMAC-SHA256 over METHOD|path|ts|sha256(body) using the source ManagedServer.gateway_secret (or the platform GATEWAY_SECRET as a fallback). Comparison uses constant-time hmac.compare_digest.

The source IP must resolve to a ManagedServer row that already exists in the target's database; otherwise the request is rejected with 401.

Encrypted Credential Storage

SSH keys and passwords are stored on the transfer record using EncryptedTextField / EncryptedCharField (Fernet) — values are encrypted at rest in the database.

The transfer worker scrubs these fields as soon as the transfer reaches a terminal state:

  • target_ssh_key and target_ssh_password are cleared on COMPLETED, FAILED, and ROLLED_BACK.
  • source_ssh_key and source_ssh_password are cleared on FAILED.
  • When the Celery worker fails to enqueue the transfer, all four fields are cleared on the FAILED record.

Transfer logs are also redacted before persistence: PEM private key blocks, *_TOKEN/*_SECRET/*_PASSWORD/*_KEY assignments, and user:password@ segments in URLs are stripped.

Troubleshooting

"Target server IP is in a forbidden range (SSRF protection)"

The resolved target IP is in a loopback, link-local, or RFC 1918 range, and you did not select a ManagedServer for it. Use a connected ManagedServer (target_server_id) when transferring to a private LAN address, or supply a public IP.

"No SSH credentials available for target server"

Neither target_ssh_key nor target_ssh_password was supplied, and the ManagedServer for the target has no stored credentials. Open Servers → Edit on the target and re-save the SSH key or password, or pass credentials in the API request body.

"Target server 'X' is currently OFFLINE. Transfers are only allowed to ONLINE nodes."

The connected server is registered but not currently online. Bring the target back online, wait for the next mesh probe to mark it ONLINE, then re-queue the transfer.

"Source SSH credentials required for node-to-node transfer."

The source is a connected (non-local) ManagedServer with no stored SSH credentials. Either pass source_ssh_key / source_ssh_password in the request, or edit the source server and save its SSH credentials.

"Encrypted backup detected but BACKUP_ENCRYPTION_KEY is not set."

The source's backup is encrypted, but the controller's environment does not have the matching key. Set BACKUP_ENCRYPTION_KEY in the source .env to the same value used at backup time, restart the backend, and re-create the transfer.

Transfer hangs in RESTORING

The remote Django restore script is waiting on the database. Inside the target backend container:

docker exec -it smsly-hosting-backend-1 python manage.py shell \
  -c "from django.db import connection; connection.ensure_connection()"

If the connection fails, the target's PostgreSQL is unreachable. Restart the database with docker compose -f docker-compose.prod.yml restart db on the target and let the transfer retry.

"RESTORE_FAILED: …" in transfer logs

The remote restore script reported an unrecoverable error. The full traceback is in the transfer's logs field. The most common cause is a mismatched owner email on the target — the script falls back to a superuser but logs a warning. Verify the source's service owner has a corresponding account on the target.

Rollback button is missing

can_rollback is False because either (a) the transfer did not complete, (b) the 48-hour rollback window has passed, or (c) rollback was already used. After the deadline the source state is no longer guaranteed to be intact and a rollback could corrupt the source.

Service is live on target but DNS still points to source

Cloudflare DNS is only updated automatically when PlatformConfig.cloudflare_api_token and PlatformConfig.domain are set on the source. If either is missing, update the A record manually (for SERVICE on a Lite Agent target, point the service subdomain at the target; for FULL, point the apex + wildcard at the target).