Grid — Free Open-Source PaaS for Ecosystem Deployment

Overview

The intelligence subsystem is a small set of Celery periodic tasks and a remediation engine. Unlike the chat / Senate subsystem, it does not require an LLM to be configured — the engine falls back to rule-only mode when no provider is available.

The boundary between the two layers is:

AI (see AI & Intelligence) — interactive, user-driven, LLM-billable. Throttled by AIChatRateThrottle and AIAnalysisRateThrottle. Always requires an authenticated user.
Intelligence (this page) — automated, schedule-driven, runs in the background. Does not require user input and does not hit the per-user LLM cap.

The two layers share the apps.intelligence Django app and the same provider configuration, but they are not coupled: you can disable the AI chat endpoints and still have the periodic scans run.

Periodic Tasks

There are three Celery beat schedules that power the runtime layer. They are registered when the apps.intelligence app boots.

`detect_anomalies_task` — every 3 minutes

Runs LogAnalyzer.analyze_logs() over the last 20000 chars of each service's latest deployment logs, plus a health-status fallback. Detected patterns (CRASH_LOOP, OOM_KILLED, DB_CONNECTION_TIMEOUT, etc.) are passed to the RemediationEngine.apply_fix() for auto-remediation.

The scan walks services in a paginated batch of 100. Each service is processed in its own try / except so a single broken service does not abort the entire scan. The summary is logged at INFO and returned to the Celery result backend:

{
  "checked": 247,
  "issues_detected": 3,
  "auto_fixed": 2,
  "errors": 1
}

`proactive_health_scan_task` — every 5 minutes

Walks every service with health_status='unhealthy' and calls RemediationEngine.apply_fix('HEALTH_CHECK_FAIL', service_id). The remediation action is RESTART_OR_ROLLBACK — it first attempts a container restart, then rolls back to the previous ACTIVE deployment if the restart does not bring the service back to healthy.

This is intentionally conservative: it only operates on services that are already marked unhealthy. It does not speculatively restart healthy services.

`daily_intelligence_report_task` — 06:00 UTC

Generates a daily summary of the last 24 hours. The report covers:

Total deployments.
Failed deployments.
Success rate.
Number of anomalies detected (from AuditLog rows with actor in ['AI_REMEDIATOR', 'AI_REVIEWER']).

The report is stored as an AuditLog row with actor='AI_REPORTER', action='DAILY_REPORT', and target='SYSTEM'. Reports are immutable and form a permanent, hash-chained daily ledger.

AI Codemap

The LogAnalyzer class is the platform's primary log-pattern recognizer. It maintains a small library of regex / heuristic patterns and a confidence score per pattern. When the configured LLM is available, ambiguous patterns are sent to the model for confirmation; the response is folded into the confidence score.

Issue	Pattern	Confidence (rule)	LLM-confirmed
OOM_KILLED	exit 137 / `Out of memory`	0.95	0.99
CRASH_LOOP	restarted >3× in 5min	0.85	0.92
DB_CONNECTION_TIMEOUT	`psycopg2.OperationalError`	0.80	0.90
BUILD_FAILURE	`npm ERR!`, `pip: command not found`	0.90	0.95
HEALTH_CHECK_FAIL	`GET /health` returning 5xx	0.90	0.95
SSL_CERT_EXPIRED	`x509: certificate has expired`	0.95	0.99
DISK_FULL	`No space left on device`	0.99	0.99
PORT_CONFLICT	`bind: address already in use`	0.95	0.97
DNS_FAILURE	`no such host`	0.85	0.92
DEPENDENCY_MISSING	`ModuleNotFoundError`	0.90	0.95
TIMEOUT	`context deadline exceeded`	0.70	0.85

Patterns with confidence below 0.9 are surfaced to the dashboard but not auto-remediated — they wait for a human approve action via the AI assistant.

Anomaly Detection

Anomalies are detected in two ways:

Pattern-based (see above) — the LogAnalyzer runs over the latest deployment's build_logs plus a health-status fallback (if a service is health_status='unhealthy' for any reason and no log pattern matches, the analyzer synthesizes a CRASH_LOOP issue with confidence 0.9).
AI-enhanced (when an LLM is configured) — ambiguous patterns are sent to the model with the last 20000 chars of logs. The model's response is parsed for type, confidence, and a free-text fix recommendation.

The detect_anomalies_task walks services in batches of 100. For each service, it:

Fetches the latest deployment's build_logs (or None).
Falls back to the service's health_status if there are no logs.
Runs LogAnalyzer.analyze_logs().
For each issue with confidence >= 0.9, calls RemediationEngine.apply_fix(issue_type, service_id).
Logs the result to AuditLog.

Self-Healing

The RemediationEngine knows about a set of remediation actions. Each is a (action, resource, message, [amount]) tuple. The actions are pre-conditions for the side-effect they trigger.

Issue	Action	Side effect
OOM_KILLED	SCALE_UP (MEMORY +256MB)	Increments `memory_mb` by 256 (capped at 2048), then triggers a re-deploy.
DB_CONNECTION_TIMEOUT	SCALE_UP_POOL	Emits an `AuditLog` with the recommended fix; does not auto-apply.
CRASH_LOOP	ROLLBACK	Finds the most recent `ACTIVE` deployment and triggers an instant-rollback.
SSL_CERT_EXPIRED	NOTIFY_ADMIN	Emits an admin notification; Caddy-managed certs auto-renew.
DISK_FULL	CLEANUP	Runs `docker system prune -f`. Gated by `explicit_admin=True`.
PORT_CONFLICT	RESTART	Issues a Docker restart on the running container.
DNS_FAILURE	NOTIFY_ADMIN	Emits an admin notification.
DEPENDENCY_MISSING	REBUILD	Triggers a fresh deploy (with cache invalidation).
BUILD_FAILURE	NOTIFY_AND_DIAGNOSE	Generates an AI diagnosis and writes it to `Deployment.ai_diagnosis`.
TIMEOUT	SCALE_UP (REPLICAS +1)	Increments `min_replicas` by 1, then triggers a re-deploy.
HEALTH_CHECK_FAIL	RESTART_OR_ROLLBACK	First attempts a Docker restart; if the service is still unhealthy on the next scan, triggers a rollback.

The engine is cooldown-aware for auto-deploys. After a remediation triggers a re-deploy, no further re-deploy is triggered for the same service within AUTO_DEPLOY_COOLDOWN_MINUTES = 10 minutes. This prevents oscillation when the same issue repeats in consecutive scans.

The Service.last_scale_at field (see Autoscaling) is also respected: any scaling action updates last_scale_at, and the autoscaler's 1-minute cooldown is applied across all scale changes.

The `explicit_admin` Gate

Two of the side effects — CLEANUP (the docker system prune call) and certain ad-hoc notifications — are destructive or external. The engine refuses to run them unless explicit_admin=True is passed to apply_fix(). The proactive scan (proactive_health_scan_task) and the anomaly scan (detect_anomalies_task) never pass explicit_admin=True. Only the admin's manual "fix this now" action in the UI (or a direct API call) can trigger these actions.

This is enforced at the engine level: the CLEANUP action returns False immediately if explicit_admin is falsy, and logs a warning. The platform's docker system prune cache key (docker_prune:<server_id>) is set after a successful prune and prevents re-running the command within 24 hours.

Service-Locking

The engine uses SELECT … FOR UPDATE on the Service row before applying any fix. This is necessary because the anomaly scan and the proactive scan can run concurrently on the same service — without a row lock, two scans could observe the same issue and each trigger a re-deploy, doubling the remediation work. The lock is held for the duration of the apply_fix() call.

Security

Docker Pruning Is Gated

docker system prune -f removes all stopped containers, dangling images, and unused networks. The action is destructive: any in-flight BUILDING deployment that depends on a removed image will fail.

The engine only runs the prune when explicit_admin=True AND no prune has been issued for the same server_id in the last 24 hours. The 24-hour cooldown is enforced via cache.set("docker_prune:<server_id>", now, DOCKER_PRUNE_COOLDOWN_SECONDS).

The prune is run via subprocess.run(['docker', 'system', 'prune', '-f'], timeout=30, check=True). The 30-second timeout prevents a stuck prune from blocking the Celery worker. The check=True ensures a non-zero exit is raised as an exception, which the engine catches and turns into a False return.

Audit Trail

Every remediation action writes an AuditLog row. The chain is:

actor = "AI_REMEDIATOR".
action is the action name (SCALE_UP, CLEANUP, REBUILD, RESTART, NOTIFY_ADMIN, …).
target is the service name (or "SYSTEM" for platform-wide actions like CLEANUP).
metadata includes the old / new values, the reason, and any side-effect details (e.g. PR URLs from Jules).

Because the chain is hash-linked (see AuditLog.save() in models_audit.py), the audit trail cannot be tampered with retroactively.

API Reference

The intelligence layer exposes a small set of read-only API endpoints. There are no write endpoints for periodic tasks (they run on Celery beat); the only user actions are "view" and "trigger scan".

Endpoint	Method	Purpose
/api/v1/ai/report/	GET	Most recent `DAILY_REPORT` row.
/api/v1/ai/anomalies/	GET	Last 100 `AuditLog` rows with `actor in ['AI_REMEDIATOR', 'AI_REVIEWER']`.
/api/v1/ai/cost-estimate/	POST	Estimates the LLM cost of a proposed prompt (no actual call).
/api/v1/ai/analyze/	POST	Run a one-shot log analysis on a deployment.
/api/v1/jules/history/{service_id}/	GET	Returns the auto-fix history for a service.

These endpoints are throttled by AIAnalysisRateThrottle (10/minute per user). They are not admin-gated; any authenticated user can read the report and anomaly history.

Troubleshooting

"Daily intelligence report did not generate at 06:00 UTC"

Celery beat is not running, or the platform's timezone is misconfigured. The task is registered with a fixed 06:00 UTC cron; if the platform's TIME_ZONE is not UTC, the report will still be generated at 06:00 UTC and stored under the UTC date. Verify with:

docker exec smsly-hosting-backend-1 python manage.py shell \
  -c "from apps.intelligence.tasks import daily_intelligence_report_task; print(daily_intelligence_report_task)"

"Anomaly scan returned 0 issues_detected across 247 services"

This can be normal if the platform has been running quietly. If you expect issues, check the platform's log shipping — Loki / Promtail must be running, and the prometheus_loki integration must be configured for the LogAnalyzer to receive runtime log streams.

"RemediationEngine refused to run docker system prune"

explicit_admin was not passed. The scan path never passes it. Trigger the prune manually from the UI (Settings → Storage → Cleanup) or call RemediationEngine().apply_fix('DISK_FULL', service_id, explicit_admin=True) from a Django shell.

"Remediation triggered a re-deploy that is still running. The next scan skipped the service."

This is the auto-deploy cooldown. After a remediation triggers a re-deploy, the same service will not be re-deployed by the engine for 10 minutes. Wait for the cooldown, or for the active deploy to reach a terminal state.

"I disabled AI in Settings but the periodic scans still run"

That is expected. The intelligence scans do not require an LLM — they fall back to rule-only mode when no provider is configured. To disable the scans entirely, set INTELLIGENCE_DISABLED=True in .env and restart the backend and beat scheduler.

AI & Intelligence Functions

Intelligence (Runtime)