Back to Docs
Deployments

Deployments

Source to running container. Git, Docker, upload, template, or inline function. Every step observable, audit-logged, rollback-safe.

Overview

A deployment is one attempt to promote a new revision of a service. Each deployment has a single status that advances through a fixed set of states. Deployments are asynchronous: the API returns the new record immediately and a Celery worker drives it through the pipeline.

Common reasons to use deployments:

  • Ship a new commit to a running service.
  • Roll back a broken release to the last ACTIVE revision.
  • Wire a Git provider to deploy on every push.
  • Promote a tagged release to production.
  • Re-run the pipeline after a settings change, env var update, or build-config tweak.

Deployments always run in the context of a Service. A service has a deploy_type (GIT, DOCKER, UPLOAD, TEMPLATE, or FUNCTION) that determines how the pipeline is wired.

Deployment Types

deploy_typeSource of truthWhen to use
GITA Git repository (GitHub, GitLab, Bitbucket) reachable from the build agent.The common case: your application lives in a Git repo.
DOCKERA pre-built image reference (e.g. ghcr.io/org/app:abc1234).You build images elsewhere (CI, local Docker) and want Grid to host them.
UPLOADA source tarball uploaded through the API.One-off deploys, prototypes, environments without a Git provider.
TEMPLATEA one-click template from the Grid catalog.Spinning up Postgres + Redis + app stacks with a few clicks.
FUNCTIONInline source code stored on the Service row.See Functions for the serverless workflow.

Build Phases

A GIT deployment passes through seven observable phases. The phase name is the pipeline_stages entry, and the deployment's status reflects the dominant phase.

QUEUED  →  REVIEW  →  BUILDING  →  PUSH  →  DEPLOYING  →  HEALTH_CHECK  →  ACTIVE
                          │            │           │              │
                          └─ BUILD_FAILED   PUSH_FAILED DEPLOY_FAILED HEALTH_FAILED → FAILED
  1. Clone — shallow git fetch --depth=1 to the commit hash, into build_<deployment_id>_*.
  2. Analyze — reads package.json, pyproject.toml, requirements.txt, Dockerfile, nixpacks.toml. The output is Deployment.review_summary. Fresh GIT deploys pause at REVIEW.
  3. Build — the chosen buildpack (NIXPACKS, DOCKER, or STATIC) produces a container image.
  4. Push — image is pushed to the local insecure registry on MASTER_MESH_IP:5000 on multi-node fleets. Single-node: image is loaded into the local Docker daemon.
  5. Deploy — new container started. The strategy (ROLLING, BLUE_GREEN, or CANARY) is set on the service.
  6. Health check — Traefik sends GET <health_check_path> at health_check_interval (default 30s).
  7. Active — new container is now serving traffic. All other ACTIVE deployments for the same service are demoted to INACTIVE.

Status Reference

Every deployment carries a single status value. The list below covers all defined statuses; the most common ones are bolded.

StatusPhaseTerminal?
QUEUEDinitialNo
REVIEWanalyzeNo
BUILDINGbuildNo
BUILD_FAILEDbuildYes
AWAITING_APPROVALreviewNo
BACKUP_RUNNINGpre-deployNo
BACKUP_FAILEDpre-deployNo
MIGRATION_PLANNINGpre-deployNo
MIGRATION_RUNNINGpre-deployNo
MIGRATION_FAILEDpre-deployNo
DEPLOYINGdeployNo
HEALTH_CHECKhealthNo
ACTIVEsuccessYes (lifecycle)
INACTIVEpost-successYes (lifecycle)
FAILEDanyYes
CANCELLEDanyYes
ROLLING_BACKanyNo
ROLLED_BACKterminalYes

BUILDING, DEPLOYING, HEALTH_CHECK, BACKUP_RUNNING, MIGRATION_RUNNING, and ROLLING_BACK are the active statuses. A service can only have one active deployment at a time; creating a second one returns HTTP 409 with the existing deployment in the response body.

API Reference

All endpoints are mounted under /api/v1/. Authentication is session- or token-based for user endpoints and HMAC-signed for node-to-node traffic.

Trigger a deployment

curl -sS -X POST http://localhost:8000/api/v1/deployments/trigger/ \
  -H "Authorization: Token $SMSLY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "service_id": "9c8b4b1a-7d1c-4a2b-9a55-2e8c3d4f9b21",
    "provider_id": "f1c2b0c1-1234-5678-9abc-def012345678",
    "commit_hash": "abc1234"
  }'

Returns HTTP 201 with the new deployment record and status=QUEUED.

Cancel a deployment

curl -sS -X POST \
  http://localhost:8000/api/v1/deployments/2d3e4f5a-6b7c-8d9e-0f1a-2b3c4d5e6f7a/cancel/ \
  -H "Authorization: Token $SMSLY_TOKEN"

Allowed only when the deployment is in QUEUED, REVIEW, BUILDING, or AWAITING_APPROVAL.

Approve a paused deployment

curl -sS -X POST \
  http://localhost:8000/api/v1/deployments/2d3e4f5a-6b7c-8d9e-0f1a-2b3c4d5e6f7a/approve/ \
  -H "Authorization: Token $SMSLY_TOKEN"

Roll back a deployment

curl -sS -X POST \
  http://localhost:8000/api/v1/deployments/2d3e4f5a-6b7c-8d9e-0f1a-2b3c4d5e6f7a/rollback/ \
  -H "Authorization: Token $SMSLY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"confirm": "true"}'

The confirm: "true" gate prevents accidental rollbacks. The endpoint creates a new deployment row with is_rollback=True.

One-click rollback

curl -sS -X POST \
  http://localhost:8000/api/v1/services/9c8b4b1a-7d1c-4a2b-9a55-2e8c3d4f9b21/instant-rollback/ \
  -H "Authorization: Token $SMSLY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"message": "5xx spike after deploy"}'

Looks up the most recent ACTIVE deployment and rolls back to it. The caller does not need to know the deployment ID.

Full API reference

See docs/deployments.md in the repository for every endpoint, request body, response field, and error code — including /api/v1/deployments/{id}/rollback/, instant-rollback/, and the multi-server deploy/ / multi-deploy/ actions.

Webhook Setup

Grid accepts webhooks from GitHub, GitLab, and Bitbucket. Each delivery creates a deployment for the matching service, and the webhook handler is idempotent: a WebhookDelivery row is keyed on the provider's delivery_id, so duplicate deliveries are dropped.

GitHub

  1. In your repo, go to Settings → Webhooks → Add webhook.
  2. Set Payload URL to https://<your-grid-host>/api/v1/webhooks/github/.
  3. Set Content type to application/json.
  4. Set Secret to the same value as GITHUB_WEBHOOK_SECRET in the Grid .env.
  5. Choose Let me select individual events and enable Push and Pull request.
  6. Save. Push to the configured branch to fire a deployment.

GitLab

  1. Settings → Webhooks in the project.
  2. URL: https://<your-grid-host>/api/v1/webhooks/gitlab/.
  3. Trigger: Push events and Merge request events.
  4. Set the Secret token to GITLAB_WEBHOOK_SECRET.

Bitbucket

  1. Repository settings → Webhooks → Add webhook.
  2. URL: https://<your-grid-host>/api/v1/webhooks/bitbucket/.
  3. Triggers: Repo: push and Pull request: created / updated.

Buildpacks

A service's buildpack field selects the build strategy. The default is NIXPACKS.

BuildpackBehavior
NIXPACKSDetects the language and emits a multi-stage Dockerfile. Supports Node, Python, Go, Ruby, Rust, Java, PHP, Elixir, Deno, Bun.
DOCKERUses the Dockerfile at the service's root_directory (default /).
STATICServes the directory as a static site. Traefik routes / to a small nginx container.

Environment Variables

Service.env_vars is a list of (key, value, is_secret, is_locked, source) rows. The values are stored as EncryptedCharField and decrypted at deploy time.

Precedence

The final env on the new container is the union of these sources, in this order (later overrides earlier):

  1. Platform defaultsPORT, SMSLY_API_KEY, SMSLY_PUBLIC_DOMAIN.
  2. Addon auto-injectionsource=ADDON.
  3. Shortcode resolutionsource=SHORTCODE. Example: {{pg.MAIN.DATABASE_URL}}.
  4. System auto-injectionsource=SYSTEM. Includes DEPLOYMENT_ID, COMMIT_HASH, BRANCH, SERVICE_NAME.
  5. User-definedsource=USER. Highest precedence.

If a user-defined row is marked is_locked=True, it cannot be overridden by any auto-injection step.

Health Checks and Auto-Restart

Each service has its own health check config:

  • health_check_path (default /health)
  • health_check_port (blank = auto-detect from PORT env)
  • health_check_interval (default 30s)
  • health_check_timeout (default 300s)
  • health_check_retries (default 90)
  • auto_restart (default True)
  • restart_policy (always, unless-stopped, on-failure, no)

Containers can also push their own health status via the Service Health Webhook:

curl -X POST https://<your-grid-host>/api/v1/services/<service-id>/health/webhook/ \
  -H "X-Health-Webhook-Token: <service.health_webhook_token>" \
  -H "Content-Type: application/json" \
  -d '{"status": "healthy", "details": {"db": "ok", "cache": "ok"}}'

Accepted status values: healthy, unhealthy, starting, needs_manual_intervention.

Autoscaler Interaction

The autoscaler can mutate Service.min_replicas while a deploy is in flight. To prevent the deploy's container plan from drifting, the platform snapshots min_replicas onto the deployment row at queue time as Deployment.queued_min_replicas. The deploy executor uses this snapshot to decide how many containers to bring up at deploy time, not the live min_replicas field.

This means:

  • If a user triggers a deploy and the autoscaler is concurrently scaling up, the new deploy starts with the smaller count and the autoscaler brings the extra replicas online a few seconds later.
  • If the autoscaler is concurrently scaling down, the new deploy starts with the larger count and the autoscaler schedules a scale-down after its cooldown elapses.

See Autoscaling for the full replica controller design.

Security

Deployment Throttles

The DeploymentViewSet is gated by two DRF throttles:

  • BurstRateThrottle3/minute per user. Prevents rapid-fire re-triggers.
  • DeploymentRateThrottle10/hour per user. Prevents resource exhaustion from excessive builds.

Both return HTTP 429 with a Retry-After header.

Audit Log

Every state change on a deployment writes an AuditLog row. The chain is hash-linked — see the AuditLog.calculate_hash() and AuditLog.save() overrides in models_audit.py. Logs are immutable.

Common audit events emitted by the pipeline:

  • DEPLOYMENT_TRIGGER — user triggered a new deployment.
  • DEPLOYMENT_ROLLBACK — user requested a specific rollback.
  • DEPLOYMENT_ROLLBACK_INSTANT — user clicked instant-rollback.
  • DEPLOYMENT_APPROVE — user approved a paused deployment.
  • DEPLOYMENT_CANCEL — user cancelled a deployment.

SSRF Protection

The deploy pipeline clones repositories over https:// or git://. URLs are validated against _validate_registry_url() which:

  • Rejects loopback, link-local, multicast, reserved, and unspecified ranges.
  • Accepts private RFC 1918 ranges only when the host resolves to a registered CloudProvider.
  • Rejects non-HTTPS URLs unless the host is in the platform's localhost / Docker service list.

Troubleshooting

"Deployment already in progress (status: BUILDING)"

There is an active deployment for this service. Either wait for it to finish or POST /api/v1/deployments/{id}/cancel/. Creating a second active deployment returns HTTP 409 with the existing deployment in existing_deployment.

"Cannot cancel deployment in HEALTH_CHECK status"

HEALTH_CHECK is past the cancel boundary. Wait for the deployment to reach ACTIVE or FAILED, then trigger a rollback if needed.

Build hangs in BUILDING

The buildpack has stalled — usually a network failure (npm registry down, apt-get update timing out) or a runaway npm install cycle. Inspect GET /api/v1/deployments/{id}/build-logs/ for the live log tail.

"BUILD_FAILED: exit 137"

OOM-killed during build. Reduce build memory pressure (move large assets out of the build, use .dockerignore) or raise the platform's per-task memory limit (see docker-compose.prod.yml).

"ENCRYPTION_KEY_MISMATCH" at restore time

A BACKUP_ENCRYPTION_KEY was rotated without restarting the backend, or the encrypted backup was made on a different installation. Set BACKUP_ENCRYPTION_KEY to the value used at backup time, restart the backend, and re-run the deploy.

Health checks pass on the dashboard but the public domain returns 502

The platform considers the container healthy, but the Traefik route is stale. Force a route re-check: POST /api/v1/services/{id}/recheck-health/ and then POST /api/v1/system/route-recheck/.

Webhook deliveries do not trigger deployments

Inspect the WebhookDelivery table — duplicate deliveries are recorded with status=ignored. The most common cause is a webhook signed with a secret that does not match the service owner's CloudProvider config.

"vulnerability_report is empty after build"

The Trivy scan was skipped. This happens when the image is on a registry that Trivy cannot reach. Configure TRIVY_REGISTRY_USERNAME / TRIVY_REGISTRY_PASSWORD in the platform .env and re-trigger.