📋 Runbooks Index

Operator incident hub — grouped by incident type with severity triage guidance.

Open This First = Start here during an active incident · P0 = Critical / immediate · P1 = High / urgent · P2 = Medium / scheduled
🚨
Active incident? Start here. Open these three P0 runbooks in order during any platform emergency.
P0 Critical — platform down, data loss risk, all hands
P1 High — degraded service, urgent fix required
P2 Medium — planned ops, non-urgent investigation
★ Open This First Primary entry point during this incident type
🦆

Agent Fleet Incidents

Use when agents go silent, fleet health drops, or peck-protocol failures occur.
Ops Runbook
Open This First P0

Primary operator guide for fleet emergencies. Covers beak-key rotation, agent re-registration, pulse escalation, and live fleet triage checklists. Start here when agents go silent or peck failures spike.

Open Ops Runbook →
Fleet Recovery
Open This First P0

Step-by-step fleet restoration when agents.alive = 0. Guides re-registration, heartbeat validation, beak-key verification, and post-recovery health checks. Use when Mission Control shows zero alive agents.

Open Fleet Recovery →

🚀

Deployment & Lambda

Use for deployment drift, Lambda version mismatches, or rollback scenarios.
Deployment Log
P1

Chronological record of all deployments, rollbacks, and infrastructure changes. Use to correlate incidents with recent deploys, verify which Lambda versions are live, and audit change history during post-mortems.

Open Deployment Log →
Lambda Versions
P1

Live Lambda function version matrix across all environments. Use to detect version drift, verify deployment parity, and identify which function revision is active per alias. Cross-reference with Deployment Log for rollback targets.

Open Lambda Versions →

📡

System Health

Use to assess platform-wide health, SLA status, and alert thresholds before escalating.
System Health Dashboard
Open This First P0

Real-time SLA metrics, uptime history, and performance trends. Open first during any suspected outage to establish platform baseline. Shows latency, error rates, and 30-day uptime bars across all core services.

Open System Health →
Alerts Config
P1

Alert thresholds, notification routing, and on-call configuration. Use to verify that alerts are firing correctly, adjust sensitivity after a noisy incident, or confirm which channels receive P0/P1 notifications.

Open Alerts Config →

🔥

Incident Management

Use to log, track, and retrospect on active or past incidents.
Incident Log
P1

Active and historical incident tracker. Log new incidents here when escalating, track status updates in real time, and record resolution timelines. Required for any P0/P1 event to maintain operator audit trail.

Open Incident Log →
Incident Postmortem
P2 Documentation in progress

Structured postmortem template for P0/P1 incidents. Use after resolution to document root cause, contributing factors, timeline of events, and action items to prevent recurrence. Template in progress.

Documentation in progress

📜

Governance

Use for scheduled operator audits, compliance checks, and change management.
Governance Log
P2

Audit trail for operator decisions, policy changes, and compliance checkpoints. Review before major releases or certification audits to verify change authority and operator accountability records.

Open Governance Log →
Ops Checklist
P2

Pre-deployment and daily operator checklist. Run before pushing to production or during scheduled maintenance windows. Covers cert health, fleet pulse, API gateway status, and infrastructure readiness gates.

Open Ops Checklist →
Runbooks Index · Space Duck Operator Hub · ← Mission Control