✅ Ops Checklist

Pre-deploy, rollback, and incident-triage checklists. State persists across sessions.

Checklist state auto-saved to localStorage.
Pre-Deploy
0 / 10
Rollback
0 / 8
Incident Triage
0 / 9
All Checklists

🚀 Pre-Deploy Checklist

Complete before promoting any Lambda version to prod or deploying frontend changes.
Deployment
Confirm lambda_alias_version matches deployed function version. Check alias drift card on Mission Control.
Mission Control → Alias Drift ↗
Fetch the status endpoint and verify response is valid JSON with lambda.version, database, and agents keys present.
System Health ↗
A dead fleet during deployment may mask downstream failures. Check Agent Fleet tile on Mission Control before promoting.
Mission Control → Agent Fleet ↗
Check Peck Failure Analysis on Mission Control. A high failure rate may indicate a callback issue that would worsen post-deploy.
Peck Audit ↗
Check Audit Activity strip on Mission Control. Look for abnormal counts of peck_notify, peck_requested, or duck.cert_issued events in the last 30 minutes.
Deployment can trigger transactional emails. If quota is near 160/200 (amber), defer to off-peak or notify T-JOSH.
Mission Control → SES Quota ↗
Check the Change Freeze banner on Mission Control. If a freeze is active, deployment is visually blocked until the freeze is cleared or explicitly overridden with T-JOSH approval.
Mission Control → Change Freeze ↗
Invoke the function with test payload and confirm no cold-start errors. Review CloudWatch logs if cold starts exceed 2,000ms.
Document: version number, what changed, timestamp, and operator. This feeds the deployment-log.html surface and incident handoff bundles.
Deployment Log ↗
All Lambda alias promotions require T-JOSH gate. Record the approval timestamp and approval method before executing the promotion.
Governance Log ↗

⏪ Rollback Checklist

Steps to safely roll back a Lambda alias to a previous known-good version.
P0-Ready
Note the current lambda.version and lambda_alias_version values. This is the version being rolled back from.
Mission Control → Alias Drift ↗
Check the deployment history on Mission Control or deployment-log.html. Note the prior Lambda version number that was stable.
Deployment Log ↗
Use the Incident Mode toggle on Mission Control to freeze the dashboard and flag the active incident with a UTC start time. This prevents refresh races during rollback.
Mission Control → Incident Mode ↗
Rollback is a governance action. Record the approval in GOVERNANCE-LOG.md with the target version, reason, and approver before running the AWS CLI command.
Governance Log ↗
Replace <prev-version> with the known-good version number identified in step 2. Do not use $LATEST for prod alias.
Refresh Mission Control and confirm the Alias Drift card shows the expected version with no drift. The /beak/system/status lambda_alias_version should match the target.
Mission Control ↗
After rollback, verify core API routes respond correctly. Check system-health.html for green status across all SLA metrics.
System Health ↗
Document: reverted version, target version, reason for rollback, approver, and timestamp. Exit incident mode on Mission Control after verification.
Incident Log ↗

🚨 Incident Triage Checklist

First-response steps for any platform incident. Work top-to-bottom.
P0 Response
Go to Mission Control and toggle Incident Mode. This freezes the dashboard, highlights drift and failure cards, and records a UTC start time to localStorage.
Mission Control ↗
Mission Control Anomaly Summary surfaces the top operator issues (alias drift, dead agents, SES/SNS sandbox, peck callback errors). Record what's flagged before taking action.
Mission Control → Anomaly Summary ↗
The health score (0–100) on Mission Control penalises dead agents, stale snapshots, peck failures, and sandbox blockers. Score <60 = red alert. Investigate the highest-penalty item first.
System Health ↗
Verify /beak, /beak/metrics, and /beak/system/status each return 200. A route returning 502 or 504 indicates a Lambda cold start or crash loop.
Mission Control → Route Health ↗
A total fleet outage means no spaceducks are pulsing. This prevents peck approvals and cert issuance. Follow INC-003 in the ops runbook.
Ops Runbook → INC-003 ↗
Different failure types require different responses. callback_error spikes indicate webhook delivery issues. denied spikes may indicate auth config drift. expired spikes may indicate slow response times.
Peck Audit ↗
Use the "Copy Diagnostics JSON" or "Copy Incident Snapshot" action on Mission Control to capture platform state (lambda version, pipeline counts, sandbox state) with a UTC timestamp.
Mission Control → Diagnostics ↗
Any destructive action (rollback, cert revocation, connection termination) requires T-JOSH gate. Do not execute without approval captured in GOVERNANCE-LOG.md.
Governance Log ↗
After resolution, document: severity (P0/P1/P2), incident start time, detection time, resolution time, affected components, root cause, and corrective actions. Exit incident mode on Mission Control.
Incident Log ↗ Incident Postmortem ↗