Runbooks Index — Duck Galaxy

📋 Runbooks Index

Operator incident hub — grouped by incident type with severity triage guidance.

← Mission Control

Open This First = Start here during an active incident · P0 = Critical / immediate · P1 = High / urgent · P2 = Medium / scheduled

🚨

Active incident? Start here. Open these three P0 runbooks in order during any platform emergency.

1. System Health → 2. Fleet Recovery → 3. Ops Runbook →

P0 Critical — platform down, data loss risk, all hands

P1 High — degraded service, urgent fix required

P2 Medium — planned ops, non-urgent investigation

★ Open This First Primary entry point during this incident type

🦆

Agent Fleet Incidents

Use when agents go silent, fleet health drops, or peck-protocol failures occur.

Ops Runbook

Open This First P0

Primary operator guide for fleet emergencies. Covers beak-key rotation, agent re-registration, pulse escalation, and live fleet triage checklists. Start here when agents go silent or peck failures spike.

Open Ops Runbook →

Fleet Recovery

Open This First P0

Step-by-step fleet restoration when agents.alive = 0. Guides re-registration, heartbeat validation, beak-key verification, and post-recovery health checks. Use when Mission Control shows zero alive agents.

Open Fleet Recovery →

🚀

Deployment & Lambda

Use for deployment drift, Lambda version mismatches, or rollback scenarios.

Deployment Log

Chronological record of all deployments, rollbacks, and infrastructure changes. Use to correlate incidents with recent deploys, verify which Lambda versions are live, and audit change history during post-mortems.

Open Deployment Log →

Lambda Versions

Live Lambda function version matrix across all environments. Use to detect version drift, verify deployment parity, and identify which function revision is active per alias. Cross-reference with Deployment Log for rollback targets.

Open Lambda Versions →

📡

System Health

Use to assess platform-wide health, SLA status, and alert thresholds before escalating.

System Health Dashboard

Open This First P0

Real-time SLA metrics, uptime history, and performance trends. Open first during any suspected outage to establish platform baseline. Shows latency, error rates, and 30-day uptime bars across all core services.

Open System Health →

Alerts Config

Alert thresholds, notification routing, and on-call configuration. Use to verify that alerts are firing correctly, adjust sensitivity after a noisy incident, or confirm which channels receive P0/P1 notifications.

Open Alerts Config →

🔥

Incident Management

Use to log, track, and retrospect on active or past incidents.

Incident Log

Active and historical incident tracker. Log new incidents here when escalating, track status updates in real time, and record resolution timelines. Required for any P0/P1 event to maintain operator audit trail.

Open Incident Log →

Incident Postmortem

P2 Documentation in progress

Structured postmortem template for P0/P1 incidents. Use after resolution to document root cause, contributing factors, timeline of events, and action items to prevent recurrence. Template in progress.

Documentation in progress

📜

Governance

Use for scheduled operator audits, compliance checks, and change management.

Governance Log

Audit trail for operator decisions, policy changes, and compliance checkpoints. Review before major releases or certification audits to verify change authority and operator accountability records.

Open Governance Log →

Ops Checklist

Pre-deployment and daily operator checklist. Run before pushing to production or during scheduled maintenance windows. Covers cert health, fleet pulse, API gateway status, and infrastructure readiness gates.

Open Ops Checklist →

Runbooks Index · Space Duck Operator Hub · ← Mission Control