Back to Docs
Autoscaling

Autoscaling

Three engines, one shared state. CPU-based hysteresis, Prometheus + Loki + AI for capacity, and a K8s-style admin surface for manual control.

Overview

Grid ships three autoscaler implementations that work together. The classic CPU-based engine handles day-to-day scale up / scale down with predictable hysteresis. The AI-enhanced engine adds Prometheus + Loki metrics, anomaly detection, and a paginated batch driver. The K8s / Docker admin surface provides manual replica control for operators.

PathModuleTriggerScope
Classic CPUservices/autoscaler.pyCelery beat, every minuteEvery service, CPU threshold
AI-enhancedtasks_autoscale.py + scaling_ai.pyCelery beat, every 60sEvery service, Prom + Loki + AI
K8s / Docker adminapps/autoscaler/views.pyManual (HTTP)One service, per-call

The classic engine is the default and is what the platform runs out of the box. The AI-enhanced engine is opt-in via AUTOSCALER_AI_ENABLED=True and requires the prometheus_loki integration. The admin surface is always available but requires IsAdminUser.

All three share the same Service fields (min_replicas, max_replicas, autoscale_cpu_target, last_scale_at) and the same MAX_REPLICAS global guard. They coordinate via a single row-level lock (see Race Conditions).

Classic Engine

The classic engine is a CPU-based, two-threshold controller with asymmetric cooldowns. It runs on Celery beat once per minute.

How It Works

For each service with min_replicas > 0 (or autoscale_cpu_target > 0):

  1. Read the current CPU average over the last minute (sourced from docker stats on the local node, or from a ManagedServer proxy call on a remote node).
  2. Compare to autoscale_cpu_target (default 70).
  3. Scale up if cpu > target + 5% (hysteresis) AND the service is not in cooldown.
  4. Scale down if cpu < target - 20% (wider hysteresis on the way down) AND the service is not in cooldown.
  5. Update Service.last_scale_at and exit.

The asymmetric cooldown is the key invariant: scale-up cooldown is 1 minute, scale-down cooldown is 5 minutes. This is hard-coded and not configurable per service.

The last_scale_at Field (NOT updated_at)

The cooldown is computed from Service.last_scale_at, not from Service.updated_at. The updated_at field is touched by any model save (env var edit, manual replica change, settings update) — using it for cooldown would let a side effect reset the autoscaler's clock. The last_scale_at field is only written by the autoscaler itself, on a real scale event. The same field is also written by the AI-enhanced engine so the two engines cannot oscillate against each other on the same service.

AI-Enhanced Engine

The AI-enhanced engine is a superset of the classic one. It uses Prometheus for CPU / memory metrics, Loki for runtime log volume, and (when configured) the Senate Committee for capacity recommendations. It runs on a 60-second beat.

Prometheus + Loki Integration

Metrics are scraped from the platform's Prometheus instance. The engine queries:

  • sum(rate(container_cpu_usage_seconds_total{service=~"<name>"}[1m])) — CPU rate
  • sum(container_memory_usage_bytes{service=~"<name>"}) — memory footprint
  • sum(rate(loki_log_entries_total{service=~"<name>"}[1m])) — log volume rate

If the platform's Loki is not running, the engine falls back to the classic docker stats path. The integration is detected at runtime via the PROMETHEUS_LIVE and LOKI_LIVE flags on PlatformConfig.

Paginated Batch via id__gt Cursor

The engine walks all services in batches of 100 using a keyset cursor on the primary key:

qs = Service.objects.filter(id__gt=cursor).order_by("id")[:100]

This avoids the OFFSET performance cliff on large fleets. The cursor is held in cache.set("autoscale:cursor", last_id, 600) so a worker crash resumes from the same point. The walk is incremental: each 60-second tick advances the cursor by 100 services. A fleet of 10 000 services takes 100 ticks (~100 minutes) to complete a full sweep. The cursor is reset to 0 at the end of a sweep.

AI Recommendations

When AUTOSCALER_AI_ENABLED=True and an LLM is configured, the engine consults the Senate Committee on scale-up decisions that exceed max_replicas * 0.8 (i.e. the engine is about to hit the ceiling). The model is asked: "given the last 24 hours of CPU, memory, and request volume, should we raise max_replicas or hold it?" The response is logged to AuditLog with actor='AI_SCALER' and is advisory only — the engine does not auto-raise max_replicas based on the model output. An operator must approve the change in the UI or via API.

K8s / Docker Admin

The admin surface exposes a manual replica controller. It requires IsAdminUser (staff status) and is gated by ADMIN_AUTOSCALER_ENABLED (env, default True).

EndpointMethodPurpose
/api/v1/scaling/analyze/POSTOne-shot analysis (current state + recommendation).
/api/v1/scaling/spawn/POSTForce-spawn a replica. Bypasses cooldowns.
/api/v1/scaling/replicas/GETList current replica state.
/api/v1/scaling/destroy_replica/POSTForce-destroy a specific replica.
/api/v1/scaling/alert_config/PUTUpdate Service.alert_config.

Alert Config

Service.alert_config is a JSONField added in Batch C. It holds the per-service alert thresholds and the channel list. The schema is:

{
  "cpu_threshold": 85,
  "memory_threshold": 90,
  "error_rate_threshold": 0.05,
  "channels": ["email", "slack"],
  "slack_webhook_url": "https://hooks.slack.com/...",
  "cooldown_minutes": 15
}

PUT /api/v1/scaling/alert_config/ accepts a partial body. The slack_webhook_url is EncryptedCharField on a related row (not in the JSON) and is never echoed back in responses.

When the engine observes a breach, it writes an AuditLog row and emits the configured channels. The cooldown_minutes field prevents the same alert from firing more than once per window per channel.

Security

MAX_REPLICAS Guard

A global MAX_REPLICAS env var (default 32) caps the replica count on a single service. The classic engine, the AI-enhanced engine, and the admin surface all respect this cap. The check is enforced before the spawn — a request to set desired_replicas=64 is rejected with HTTP 400, not silently capped.

Race Conditions (Now Fixed)

A long-standing bug was that two concurrent scale events (e.g. a manual spawn/ and the AI-enhanced engine's tick) could both observe current_replicas=2, both decide to add one, and end up with replicas=4 instead of the intended 3.

The fix: every scale event acquires a SELECT … FOR UPDATE row lock on the Service row for the duration of the read-decide-write cycle. The lock is held inside a transaction.atomic() block. The classic engine and the AI-enhanced engine both use the same pattern; the admin surface uses it too. Concurrent calls serialize on the lock and only one observes the up-to-date current_replicas.

A residual race that cannot be fixed at the row level: a min_replicas change and a deploy starting at the same time. The deploy's queued_min_replicas snapshot (see Deployments) covers this case — the deploy uses the snapshot, not the live field.

Audit Log

Every scale event writes an AuditLog row with:

  • actor — the engine or admin user that triggered the event.
  • actionSCALE_UP, SCALE_DOWN, SPAWN, DESTROY_REPLICA, ALERT_FIRED.
  • target — the service name.
  • metadata — old / new replica count, the reason, and (for the AI engine) the model output that drove the decision.

The audit log is hash-linked — see models_audit.py. Manual spawn/ and destroy_replica/ calls log the calling admin's user ID.

API Reference

All endpoints are mounted under /api/v1/scaling/. Admin endpoints require IsAdminUser. Service-level reads require the service owner.

Analyze a service

curl -sS -X POST http://localhost:8000/api/v1/scaling/analyze/ \
  -H "Authorization: Token $SMSLY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"service_id": "9c8b4b1a-7d1c-4a2b-9a55-2e8c3d4f9b21"}'

Non-mutating. Returns the current state and a recommended desired_replicas.

Force-spawn a replica (admin)

curl -sS -X POST http://localhost:8000/api/v1/scaling/spawn/ \
  -H "Authorization: Token $SMSLY_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"service_id": "9c8b4b1a-7d1c-4a2b-9a55-2e8c3d4f9b21", "count": 1}'

Bypasses cooldowns. Capped at MAX_REPLICAS and Service.max_replicas.

List current replicas

curl -sS "http://localhost:8000/api/v1/scaling/replicas/?service_id=9c8b4b1a-7d1c-4a2b-9a55-2e8c3d4f9b21" \
  -H "Authorization: Token $SMSLY_TOKEN"

Destroy a specific replica (admin)

curl -sS -X POST http://localhost:8000/api/v1/scaling/destroy_replica/ \
  -H "Authorization: Token $SMSLY_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"service_id": "9c8b4b1a-7d1c-4a2b-9a55-2e8c3d4f9b21", "container_id": "abc123"}'

Refuses to destroy the last replica if min_replicas >= 1.

Update alert config

curl -sS -X PUT http://localhost:8000/api/v1/scaling/alert_config/ \
  -H "Authorization: Token $SMSLY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "service_id": "9c8b4b1a-7d1c-4a2b-9a55-2e8c3d4f9b21",
    "cpu_threshold": 75,
    "channels": ["email", "slack"],
    "slack_webhook_url": "https://hooks.slack.com/services/..."
  }'

The service owner (not just admins) can call this. slack_webhook_url is encrypted at rest and never echoed back.

Full API reference

See docs/autoscaling.md in the repository for the full alert_config schema, error codes, and the MAX_REPLICAS guard's behavior on edge cases.

Troubleshooting

"Service is at min_replicas but CPU is 100%"

Either the CPU is a transient spike and the cooldown will trigger a scale-up, or the engine is throttled. The classic engine scales up at 1-minute intervals; if CPU is at 100% for a full minute, the next tick will scale it up to min_replicas + 1. To force an immediate scale-up, use the spawn/ endpoint.

"AI-enhanced engine is not running"

Check AUTOSCALER_AI_ENABLED=True in .env. Then check PlatformConfig.prometheus_loki_live — both Prometheus and Loki must be reachable. The engine logs a warning and falls back to the classic path if either is down.

"Replica count is stuck at MAX_REPLICAS"

MAX_REPLICAS is a global cap. To raise it, edit .env and restart the backend. The new value is read at boot; there is no hot reload.

"Autoscaler is oscillating"

Check the cooldowns: 1 minute up, 5 minutes down. If your workload has high variance on the order of minutes, the asymmetric cooldown will still produce flapping. Lower autoscale_cpu_target so the engine is less aggressive, or set min_replicas to the average demand and let the engine only handle spikes.

"alert_config was reset to defaults after a deploy"

The default values are emitted on every service create, and the engine backfills defaults for older services when they are first scaled by the AI engine. To permanently override, save the values via PUT /api/v1/scaling/alert_config/.

"Manual destroy_replica fails with 'cannot destroy last replica'"

Service.min_replicas >= 1 and there is only one running replica. Set min_replicas=0 first, then destroy the replica.