CS计算机代考程序代写 database Business Continuity and Disaster Recovery

Business Continuity and Disaster Recovery
Cyber Resilience

Business Continuity
• Business Continuity Plan
― A document describing how an organization responds to events to ensure critical business functions continue without unacceptable delay or change
• Business Continuity Planning
― The ability to maintain the constant availability of critical systems, applications and information across the enterprise
― Creation of policies & procedures to minimize impact of events
― Helps organizations
▪ Identify impact of potential data processing operational disruptions and data loss
▪ Formulate and implement viable recovery plans to ensure the availability of data processing support for critical applications, data and services
▪ Develop, implement and administer a comprehensive BCP training, testing and maintenance program
2

Disaster Recovery
• Disaster Recovery Plan
― A documented plan that provides detailed procedures to facilitate recovery of capabilities at an alternate site. It is usually limited to major disruptions with long- term effects
• Disaster Recovery Planning
― Disaster recovery refers to the immediate and temporary restoration of critical computing and network operations after a natural or man-made disaster within defined timeframes. An organization documents how it will respond to a disaster and resume the critical business functions within a predetermined period of time; minimize the amount of loss; and repair (or replace) the primary facility to resume data processing support
3

BCMS Policy
• Statement of intent, purpose, objectives and external compliance requirements
• Scope
• Risk assessment criteria
• Classification of risks: acceptable vs tolerable vs intolerable
• Rules for mitigating and monitoring risks including timeframes
• Assignment of responsibility for developing, maintaining and testing the BCMS
• A process for providing assurance to the Board as to the adequacy of controls, plans and resources
4

Relationships of Emergency Action Plans
Business
IT
Facilities
Protect
Sustain
Continuity of Operations Plan
Occupant Emergency Plan
Cyber Incident Response
Team Recover / Resume
Major Impact
Crisis Communi- cations Plan
IT Contingen- cy Plan
Disaster Recovery Plan
Business Recovery Plan
SP 800-34, “Contingency Planning Guide for Information Technology Systems
5

The BCP Process
Project Initiation
Testing, Maintenance and Awareness
Business Impact Analysis
Plan Design and Development
Selection of Recovery Strategies
6

The BCP Process
Business Continuity Life Cycle – Business Continuity Institute Good Practice Guide 2018
7

Phase 1 – Project Initiation
• Establish need for a BCP (TRA if necessary)
• Obtain management support
• Identify resources to ensure the BCP matches overall business & technology plans
• Establish the team: both business and technical specialists
• Select business continuity planner/coordinator
• Establish project management work plan
• project scope, objectives, methods for organizing and managing the BCP, related tasks and responsibilities, resources, timeline, budget estimates
• Determine need for automated data collection tools
• Schedule meetings
• Initial and ongoing reports and presentations to management
8

BCMS Planner/Coordinator
• Is the central person and project leader
• Liaison between project team and management
• Knows the organization
• Must be able to balance needs of individual business units
• Have easy access to executive management
• Understand the charter, mission statement and executive viewpoints
• Have credibility and influence with senior management
9

Phase 2 – Business Impact Analysis
• Management-level analysis
― focused on business process interruption rather than asset value
― conduct at the same time as TRA for efficiency, where possible
• Establishes Maximum Tolerable Downtime (MTD) for each time-critical business support resource
• Formally agreed with executive-level management
• Passed to business units, IT/network, and BCP team for use in planning
• Must quantify the financial loss from each day of downtime
― direct loss
― reputation / embarrassment / loss of public confidence
― additional expenses
― loss of competitive advantage
10

Terminology
• Maximum Tolerable Downtime (MTD)
― Maximum period business can tolerate a critical business system being unavailable
― Also known as Maximum Allowable Outage (MAO)
• Recovery Time Objective (RTO)
― Time period required to fully reestablish business resource requirements, as documented in plans
• Recovery Point Objective (RPO)
― How much data enterprise is willing to recreate (or lose) following a disaster
▪ Essentially, the maximum time gap between live production system data and offsite
backup or data at alternate site
▪ RPO of less than 24 hours generally means real-time offsite data replication
11

Purpose of a BIA
• Help management understand potential impact
• Identify critical business functions and associated systems
• Identify internal and external dependencies for CBFs
• Identify staff concerns
• Analyze impact of outage
• Determine recovery windows for each business function
12

Steps Involved in a BIA
• Decide on information gathering techniques
― Surveys, interviews, group discussions, software tools, etc
• Select respondents
• Design survey instrument to gather economic and operational impact data
― Both qualitative and quantitative questions
• Analyze data
• Determine time-critical business functions
• Determine MTD’s
• Prioritize the restoration of critical business functions based on these MTD’s
• Document and report recommendations
• See SAI Global HB292 and NIST SP 800-34v1 for form templates
13

BIA Questionnaire Topics
• Description of business functions
― Name, size, hours of operation, number of employees and customers, critical timeframes, heaviest volumes, regulatory requirements
• Contact name and date
• Business process
• Financial impacts
• Operational impacts
• Legal & compliance impacts
• Damage to reputation
• Technological dependence
• Interdependencies with other units
• Existing BCP and alternate processing options
14

Phase 3 – Development of Recovery Strategies
• Must meet MTD’s agreed in previous phase
• Identification of resource requirements ― personnel
― equipment ― funds
― etc.
• Identification of alternatives available during recovery
― Business recovery
― Facility and supply recovery
― User recovery
― Technical recovery
― Data recovery
15

Business Recovery
• Identify
― Critical business functions
― Critical IT system requirements
― Connectivity requirements
― Ability to work at home
― Office space requirements
― Key personnel requirements
― Mail redirection, voice & data connections
― Interdependence with other units
― Off-site storage
― Vendor services
16

Facility Recovery
• Identify
― Minimum space – work areas, conference rooms, etc.
― Space needs for less critical resources
― Security needs at recovery sites
― Fire protection needs
― Furnishings, office equipment
― Infrastructure – HVAC, power, water, emergency power and UPS’s
― Office supplies – stationery, etc.
― Transport requirements
17

User Recovery
• Deals with personnel issues and manual procedures
― If automated procedures revert to manual, must keep records for eventual transfer to automated systems
• Identify
― Processes suitable for manual reversion
― How to deal with lost transactions
― Record storage requirements
▪ Off-site manuals, documentation, forms, etc,
― Special needs:
▪ Housing
▪ Meals, incidentals
▪ Communication with family/friends
― Notification procedures
18

Technical Recovery
• Data centre recovery
― After an incident, assess whether the primary data centre can be put back into operation within the MTD
― If so, go with it, if not,activate the DRP
• Network & Datacomms
― Telephone services
▪ Lines, PBX (if you still have one), voicemail, fax, etc.
― WAN connections
― LAN components
▪ Computers, cabling, power, switches, routers, etc. ― Physical security systems
▪ CCTV, motion detectors, lighting, access controls
19

IT Disaster Recovery Strategies
• Five basic categories:
― Hot standby
― Warm recovery
― Cold recovery
― Resilience and Redundancy
― Mobile recovery
20

Hot Standby
• Mirror sites
― A mirror site is fully operational and processing transactions in parallel with the primary site
▪ Needs high-speed network connection • Mirrored data
▪ Replicated data, replicated hardware, replicated software
▪ Provides full redundancy
▪ Rapid/automated failover
― Normally in-house
― Complex
― Synchronous data replication with transactions may require sites to be no more than 25 – 50 km apart
― Recovery time from a few seconds with full automation to a few hours with manual intervention
― RPO / data loss: 1 second or less
21

Warm Recovery
• All required operating systems and applications loaded and patched up-to- date but not running
• Offline – data must be restored from backups
• Machines often used for dev/test in order to get return on otherwise unused hardware
• RTO’s generally over 4 hours, commonly 8 – 12 hours, due to time required for data restores
22

Mainframe Hot Site
• Provided by the hardware vendor (which means IBM, these days)
• Is really warm recovery, despite the name
• Fully configured with all customer-required hardware and software
― Can be operational in a few hours
• Often multiple sites available
• Contract includes annual test time
• Expensive
• Contention for site in event of regional disasters
23

Cold Recovery
• Site has no equipment or resources, except HVAC, comms links, raised floors
• Available for longer periods of time
• Least expensive
• Appropriate when a vendor wouldn’t have the equipment you need anyway
• But requires most time to get into operation
• Operational testing is not possible
24

Mobile Sites
• A mobile site is a trailer or sea cargo shipping container full of kit
• Based on “Container Data Center” modules
• Park in your parking lot and go to work
• No relocation
• Operational testing may not be possible
• Will take some time to get into place
25

Cloud and Resilience
• Cloud offers distribution of applications across multiple zones in multiple regions
• Ensure enterprise architecture requirements include distribution across multiple regions to guard against large outages
― Amazon customers suffered 72 hours of no service when their US North-Eastern region went down
• Use SaaS (Software as a Service) for high resilience email, calendar, etc. ― But Internet connectivity then remains a problem
• Re-architect business applications on PaaS (Platform as a Service) or containers
― Provides automatic load balancing, etc.
• IaaS (Infrastructure as a Service – i.e. conventional virtual machines) requires a lot of customer work to achieve resilient design
26

Quick Ship Agreements
• A contractual arrangement with an equipment supplier to guarantee rapid shipment of replacements in the event of disaster
• Also known as “crate & ship”
27

Media Resiliency
• Redundancy and resiliency
― Computer/network hardware
― Software
― Data (backup)
― Documentation
• High Availability Techniques
― Journaling, clustering and mirroring, RAID, electronic vaulting, redundancy of network & power, remote journaling
• Key escrow for encrypted data
― e.g. encrypted filesystems on employee laptops
• Disposal & reuse
― Over-writing not sufficient for reuse at a lower classification level
28

RAID Levels
Redundant Disks
• RAID 0 – drive spanning
― 2 x 100 GB drives looks like 1 x 200 GB drive
― If either drive fails, the entire logical drive is gone
• RAID 1 – drive mirroring
― 2 x 100 GB drives look like 1 x 100 GB drive
― If either drive fails, life goes on
― Slower writes, faster reads
• RAID 4 – data striping with parity on the end drive
― 5 x 100 GB drives looks like 1 x 400 GB drive
― If any drive fails, life goes on
― Writes to parity drive are a bottleneck
• RAID 5 – parity striped, too
― 5 x 100 GB drives look like 1 x 400 GB drive
― Parity striping keeps writes fast
29

Data Recovery
• Types of backup:
• Image backups (Norton Ghost, etc.)
• System backups
• Data backups: Full, incremental, differential
• Special consideration
― Databases: live backups, journaling and log files
• Storage locations
― Must be secure
― Controlled environment
― Fire & water protection
― Far enough away, but not too far
• Personnel
― Clearances and background checks
― Tested procedures for storage, segregation, retrieval and disposal of media
Remember: nobody cares if you back up. They only care if you can’t restore.
30

Offsite Backup Controls
• Secure storage & transport of media
• 24-hour availability
• All media should be labeled (both human- and machine-readable), tracked and controlled
• Staff should be trained in backup & recovery procedures, media storage, transport & control procedures
• Test backup and restore procedures by following the complete procedures as documented
• Plan for use of backups in BCP/DRP
• Stored copies of all software
― In a program library • Source code escrow
31

Typical Achievable RPO and RTO
Description
Typically achievable recovery point objective (RPO)
Typically achievable recovery time objective (RTO)
No DRP
N/A: all data is lost
N/A
Tape vaulting
Measured in days since last backup
Days
Electronic vaulting
Hours
Hours (remote hot site) to days
Active replication to remote site (without recovery automation)
Seconds to minutes
Hours to days (dependent on availability of recovery hardware)
Active storage replication to remote “in-house” site
Zero to minutes (dependent on replication technology and automation policy)
One or more hours (dependent on automation)
Active software replication to remote “active” site
Seconds to minutes
Seconds to minutes (dependent on automation)
32

Other Issues
• Standby Services
― Licensing of operating systems and applications for alternate sites
• Software escrow
― Critical source code is entrusted to a third party so it is available if the vendor goes out of business
• Hardcopy Records
― Should be protected off-site (fireproof cabinets, etc)
• Recovery Management (Crisis Management)
― Overall coordination of the response
― Deal with issues promptly and minimize damage
▪ Needs information flow
33

Phase 4 – Plan Development and Implementation
• Determine management concerns and priorities
• Determine planning scope
• Geographical issues, organizational issues, recovery functions
• Establish outage assumptions
• Define prevention strategies
― For risk management, physical security, insurance coverage, infosec and other mitigation
• Plan for relocation of the emergency command/operations centre at an alternate site
• Identify recovery strategies for critical applications and systems at alternate sites
• Identify recovery strategies for non-critical applications and systems at alternate sites
34

Disaster Response Procedure
• Contact recovery team members for initial damage assessment ― Use a call tree or cloud-hosted SaaS application (more on this later)
• Determine extent of damage to primary site ― Personnel safety is highest priority
• If the outage time will exceed the MTD, then the DRP must be invoked
• Check the estimate of time required for business recovery (in the BCP)
• Notify management
• Declare a disaster and begin implementation of the continuity and recovery plans
35

Resumption of Critical Functions at Alternate Site
• Establish command/coordination centre
• Contact recovery site to confirm availability and needs
• Instruct off-site storage to transport backups to recovery site
• Allocate office space and resources to the recovery team
• Verify equipment at alternate site
• Activate operating systems for computers
• Configure and test voice communications
• Install software backups, restore data, test
• Notify users of schedule and alternate location
• Certify ready for operations, and open for business
36

Restoration Operations
• Returning to the primary site
• Actions
• Complete damage assessment
• Salvage equipment and documents if possible
• Clean up
• Review insurance, lodge claims
• Dispose of & replace equipment
• Coordinate repairs
― Structural, equipment, infrastructure (HVAC, plumbing, etc.)
• Reactivate fire protection and physical security
• Restore site to normal operating conditions
― least critical functions first
• Implement and test network systems
• Certify ready for operations, and open for business
37

External Communications
• Dealing with customers, shareholders, employee groups, local government, emergency services, media
• Unified response
― Primary contact should be public relations
• Break bad news directly
― Coverups and delays are counterproductive
• BCP should cover approvals process for statements
• Use technology
― Mailing lists, web site, etc.
• Select a site for press conferences
• Record events
― Useful for insurance, legal, as well as process improvement
38

Unforeseen Complications
• If listed here, would be foreseen
• But consider
― Liaison and coordination with police and emergency services
― Responsibilities to families
― Coordination with HR and legal
― Fraud opportunities, looting, vandalism, phishing
― Protection of the primary site
― Safety and legal problems
― Occupational health and safety
― Expenses exceeding the emergency manager’s authority
― Learn from events like 9/11, US hurricanes, etc.
39

Testing the Plans
• Types of Tests
― Structured walkthroughs
― Checklist test
▪ Each area checks the plan has the points they expect
― Simulation
― Scenario-based, with all players
― Parallel test
▪ Operational test – Critical systems are deployed at the alternate site to verify correct
operation
― Full interruption test
▪ Normal operations are completely halted
▪ All processing is conducted at the alternate site
• Military does this, also banking and other critical infrastructure
40

Maintenance of Plans
• The BCP documentation needs its own change management system ― Inputs to the CMS can come from the SDLC for other systems
• Make someone responsible for maintenance in their job description ― Centralize responsibility for updates
• Activities
― Change management inputs from other processes
― Plan maintenance reviews (at least annually)
― Section reviews
― Plan maintenance distribution
▪ New copies to distribution list, old versions recalled and destroyed
41

Summary
• Business continuity planning is often overlooked
• But BCP – and even DRP – is often the fallback in the event of extreme cyber scenarios
― E.g. A.E. Moller-Maersk, who rebuilt thousands of computers from the metal up in the aftermath of NotPetya
• Cloud computing can provide high resilience if systems are architected to make use of its unique features
42