Database Disaster Recovery Checklist

Purpose: To provide a comprehensive checklist for preparing for, responding to, and recovering from a database disaster, ensuring clear roles and responsibilities.

Key Roles & Responsibilities:

Ops (Operations): Day-to-day monitoring, incident management, initial response, internal communication, runbook execution, system-level fixes.
Infra (Infrastructure): Underlying hardware, networking, virtualization, storage, cloud provider interaction, foundational services.
DBA (Database Administrator): Database configuration, tuning, backup/restore, replication, data integrity, database-level issue resolution.
Biz (Business/Management): Defines RPO/RTO, business impact assessment, external/customer communication strategy, incident command.
Dev (Development): Application-level DR testing, code changes for new database endpoints, application-specific data validation.
Security: Data encryption, access control, compliance adherence during DR.

I. PRE-DISASTER / PREPARATION PHASE

(Goal: Minimize impact, ensure readiness)

Checklist Item	Description / Details	Primary (P)	Secondary (S)
A. Core Database DR Strategy
1. Define RPO (Recovery Point Objective) & RTO (Recovery Time Objective) for critical databases.	Quantify acceptable data loss (RPO) and downtime (RTO) for each database.	Biz	DBA, Ops
2. Design DR Architecture (Active-Passive, Active-Active, etc.)	Based on RPO/RTO, design the DR site infrastructure and database replication strategy (e.g., synchronous, asynchronous, log shipping, replication groups, multi-region failover).	DBA, Infra	Ops, Dev
3. Establish DR Site / Cloud Region	Provision and configure necessary infrastructure (compute, storage, network) at the DR site or secondary cloud region. Ensure network connectivity and security.	Infra	Ops
4. Document DR Plan	Comprehensive document detailing roles, responsibilities, step-by-step procedures, communication protocols, and escalation paths for a disaster scenario. Keep it accessible offline.	DBA, Ops	Infra, Biz, Dev, Security
B. Backup & Restore
1. Implement Backup Strategy	Define backup types (full, incremental, differential), frequency, retention policies, and data encryption for backups. Ensure integrity checks.	DBA	Ops, Security
2. Secure Offsite/Immutable Backup Storage	Store backups in a separate, geographically diverse, and ideally immutable location (e.g., S3 bucket with versioning/object lock, tape library).	Infra	DBA
3. Regular Restore Testing	Crucial: Periodically (e.g., quarterly) perform full database restores to a non-production environment from backups. Validate data integrity and application connectivity. Document successful restores.	DBA, Ops	Dev
C. High Availability & Replication
1. Configure Database Replication/HA	Set up and verify database replication (e.g., AlwaysOn Availability Groups, PostgreSQL Streaming Replication, MySQL Group Replication) or clustering solutions (e.g., Oracle RAC) between primary and DR sites.	DBA	Infra
2. Monitor Replication Status	Implement monitoring and alerting for replication lag, synchronization status, and errors.	Ops, DBA
3. Conduct Failover Testing	Crucial: Regularly (e.g., semi-annually) simulate a primary site failure and perform a full failover to the DR site. Test application connectivity and performance post-failover.	DBA, Ops	Infra, Dev
D. Monitoring & Alerting
1. Implement Comprehensive Monitoring	Monitor key database metrics (CPU, memory, disk I/O, network, connections, query performance, error logs) and replication health.	Ops, DBA
2. Configure DR-Specific Alerts	Set up immediate alerts for critical events such as primary database unavailability, high replication lag, disk full conditions, or DR site connectivity issues.	Ops, DBA
E. Documentation & Runbooks
1. Create/Update DR Runbooks	Detailed, step-by-step guides for failover, failback, specific recovery procedures (e.g., point-in-time recovery), and common troubleshooting scenarios.	DBA, Ops	Infra
2. Develop Communication Plan	Define internal and external communication protocols, key stakeholders, pre-approved messaging templates, and channels (email, SMS, status page).	Biz, Ops
3. Maintain Contact Lists	Up-to-date emergency contact lists for internal teams, critical vendors, and hosting/cloud providers.	Ops, Infra	Biz
F. Team Training & Drills
1. Conduct Regular DR Drills / Tabletop Exercises	Practice the DR plan regularly with all relevant teams (Ops, Infra, DBA, Dev, Biz). Identify gaps and areas for improvement.	Ops, Biz	All Teams
2. Cross-Train Team Members	Ensure multiple team members are proficient in DR procedures and can perform critical tasks.	Ops, DBA	Infra, Dev
G. External Communication Pre-Planning
1. Prepare Customer Communication Templates	Draft pre-approved templates for various stages of a disaster (incident detected, recovery in progress, service restored, post-mortem).	Biz	Ops
2. Establish Hosting Provider Communication Channels	Confirm emergency contact numbers, support portals, and escalation paths with your hosting or cloud provider. Understand their SLAs for disaster response.	Infra	Ops

II. DURING DISASTER / INCIDENT RESPONSE PHASE

(Goal: Swiftly detect, mitigate, and recover services)

Checklist Item	Description / Details	Primary (P)	Secondary (S)
A. Detection & Initial Assessment
1. Detect Database Incident	Automatic alerts triggered by monitoring systems (Ops). Manual detection by Ops/DBA.	Ops, DBA
2. Verify Incident & Assess Impact	Confirm the outage/issue. Determine the scope of impact (single database, multiple, entire system). Assess potential data loss (RPO deviation) and estimated downtime (RTO impact).	Ops, DBA	Biz
3. Declare Disaster Incident	Based on severity and impact, the Incident Commander (usually from Biz or senior Ops) formally declares a disaster and initiates the DR plan.	Biz	Ops
B. Activation & Communication
1. Activate DR Plan & Assemble Team	Engage key personnel (Ops, Infra, DBA, Biz, Dev) as per the contact list. Set up a dedicated communication channel (e.g., conference bridge, chat room).	Ops, Biz	All Teams
2. Initial Internal Notification	Inform relevant internal stakeholders (execs, sales, support) about the incident and DR activation.	Ops	Biz
3. Initial Customer Notification (if applicable)	Using pre-approved templates, inform affected customers about the incident and the ongoing recovery efforts via status page, email, or other agreed channels. Shared Responsibility: Biz provides messaging, Ops handles distribution.	Biz	Ops
4. Communicate with Hosting/Cloud Provider	Contact the provider via emergency channels. Provide details of the incident. Inquire about their platform status and assistance for underlying infrastructure issues impacting your database. Shared Responsibility: Infra takes lead, Ops provides context.	Infra	Ops
C. Database Recovery Execution
1. Analyze Primary DB Failure / Root Cause Identification	DBA and Ops diagnose the exact cause of the database failure (e.g., hardware failure, software bug, data corruption, network partition). Shared Responsibility: Ops for system-level, DBA for database-level, Infra for underlying components.	DBA, Ops, Infra
2. Decide Failover / Restore Strategy	Based on RPO/RTO, data loss, and primary cause, decide whether to failover to the replica or perform a full restore from backup.	DBA, Ops	Biz
3. Execute Database Failover / Restore	Follow the documented runbook: Initiate failover to DR site / Begin database restore from latest valid backup to DR environment.	DBA, Ops	Infra
4. Verify Database Consistency & Availability	After failover/restore, DBA performs integrity checks, ensures all data files are consistent, and verifies the database is fully online and accessible.	DBA	Ops
5. Update Application Connectivity	Reconfigure applications to point to the new DR database endpoint (e.g., update DNS, connection strings).	Dev, Ops
6. Validate Application Functionality	Dev team performs end-to-end testing of critical application flows with the DR database.	Dev	Ops
7. Fix Identified Issues (Workaround/Resolution)	During recovery, if underlying issues are found (e.g., bad config, broken script, corrupt data), apply temporary workarounds or permanent fixes to stabilize the system. Shared Responsibility: Ops/DBA for database/system level, Infra for infrastructure, Dev for application code.	Ops, DBA, Infra, Dev
8. Data Synchronization (if applicable)	If a failback is planned or partial data loss occurred, initiate processes to synchronize data from the DR site back to the original primary (once restored) or capture/apply missing transactions.	DBA
9. Provide Regular Status Updates	Maintain a steady stream of internal and external updates as per the communication plan. Shared Responsibility: Biz for external messaging, Ops for internal and distribution.	Biz, Ops

III. POST-DISASTER / RECOVERY & REVIEW PHASE

(Goal: Restore full normalcy, learn, and improve)

Checklist Item	Description / Details	Primary (P)	Secondary (S)
A. Restoration & Stabilization
1. Restore Primary Site Infrastructure	Repair or replace failed hardware/software at the original primary site. Ensure it is fully operational and healthy.	Infra	Ops
2. Decide & Execute Failback (if applicable)	If the DR site is temporary, plan and execute the failback to the restored primary site. This often involves replicating data from DR to primary, then a controlled switchover.	DBA, Ops	Infra
3. Validate Data Integrity Post-Failback	Verify data consistency and integrity after failback to the primary site.	DBA, Ops
4. Ensure Redundancy Restored	Confirm that all high availability and replication mechanisms are fully re-established and healthy on the primary site and between primary and DR.	DBA, Ops, Infra
B. Communication & Follow-up
1. Final Customer Communication	Send a final notification to customers confirming full service restoration and, if appropriate, a brief summary of lessons learned or preventative measures taken. Shared Responsibility: Biz provides messaging, Ops handles distribution.	Biz	Ops
2. Close Incident with Hosting/Cloud Provider	Confirm resolution with the provider and close any open support tickets. Document their involvement and response time. Shared Responsibility: Infra leads, Ops confirms.	Infra	Ops
C. Post-Mortem & Improvement
1. Conduct Post-Mortem / Incident Review	Hold a blameless post-mortem meeting with all involved teams. Focus on what happened, why it happened, what went well, and what could be improved.	Biz, Ops	All Teams
2. Perform Root Cause Analysis (RCA)	Deep dive into the underlying causes of the disaster and any issues encountered during recovery.	Ops, DBA, Infra, Dev
3. Document Lessons Learned	Summarize findings from the post-mortem, including identified weaknesses in the DR plan, tools, or team preparedness.	All Teams
4. Create & Assign Action Items for Improvement	Based on lessons learned, create concrete, actionable tasks (e.g., update runbooks, improve monitoring, patch systems, conduct more training) with assigned owners and deadlines.	Ops	DBA, Infra, Dev, Security, Biz
5. Update DR Plan & Runbooks	Incorporate all improvements and new knowledge into the formal DR plan and associated runbooks. Ensure documentation reflects the current state of infrastructure and procedures.	Ops, DBA	Infra, Dev, Security
6. Schedule Next DR Drill	Based on the post-mortem, plan the next DR drill to test implemented improvements and reinforce team readiness.	Ops	Biz

This checklist is a living document and should be reviewed and updated regularly to reflect changes in your infrastructure, applications, and business requirements.