CS计算机代考程序代写 ECS781P

ECS781P
CLOUD COMPUTING
Cloud Application Concerns, SLA and QoS
Lecturer: Dr. Sukhpal Singh Gill and Dr Ignacio Castro School of Electronic Engineering and Computer Science

2

Contents
• Quality of Service
• Concerns of Cloud Applications • Cloud mechanisms
3

The structure of modern cloud applications
4

States of Process
5

Essential Cloud Characteristics (from NIST)
• Pervasive network access • Location independence
• High availability
• Resource pooling and partitioning
• Extensive use of virtualization
• Automated management for cloud clients • Rapid elasticity
6

Essential Cloud Characteristics (from NIST)
• High availability • 24 X 7
• Including Scheduled Downtime
• Automated management for cloud clients
• MAPE/AI
• Rapid elasticity
•The ability that (virtualized) IT resources can be provisioned and released in response to demand, either automatically or by the cloud consumer.
7

Cloud service level agreements
• Service Level Agreement (SLA): The part of the contract between a cloud service provider and the cloud service consumer that specifies:
1. the services that are provided
2. the service metrics that the provider promises to deliver.
• These guarantees are often “carried forward”:
• the consumer of the cloud service makes the same guarantees (or its translation to its service with a higher level of abstraction) to its own consumers (clients, businesses, partners, etc)
• and so on…
• Example: https://cloud.google.com/compute/sla https://landing.google.com/sre/sre-book/chapters/service-level-objectives/
8

SLOs
• The SLA is described in term of SLOs (Service-Level Objectives):
• These Service Level Objectives (SLOs) represent a metric for measuring the service quality that the consumer receives. The SLA describes the guarantees and limitation of the service in terms of these metrics in a human-readable way.
• The cloud service provider also uses these metrics to perform periodic measurements to ensure that it is in compliance with the SLA agreement.
9

Service Quality metrics
• These service quality metrics should have the following characteristics:
• Quantifiable: based on quantitative measurements with well-defined
units (e.g. good/bad/satisfactory are NOT acceptable);
• Repeatable: should give identical results when applied under identical conditions;
• Comparable: need to be standardized and uniform (e.g. same unit should be used across all measurements of the same entity);
• Easily Obtainable: so that it can be measured and understood by consumer as well (e.g. it should not be measurements on some hidden operation of the cloud, or use “proprietary” tools).
10

11

Contents
• Quality of Service
• Concerns of Cloud Applications • Cloud mechanisms
12

Concerns of cloud application developers
• Reliability: The system should continue to work correctly in the face of adversity
• LinkedIn was down on 23 Feb 2021
• Scalability: As the system growths, there should be
reasonable ways of dealing with that growth
• Maintainability: Over time, it should be productive to not only maintain the current behaviour of the system, but also adapt it to new use cases
13

Concerns of cloud application developers
https://www.theverge.com/2021/2/23/22297620/linkedin-down-outage-issues
14

Concerns of cloud application developers
• Maintainability • Adaptive
• Perfective
• Corrective
15

Service reliability metrics
• Reliability: probability that a service (e.g. an IT resource) can perform its intended function under pre-defined conditions without experiencing failure.
• Resiliency: measures the ability of an IT resource to recover from operational disturbances.
• redundant implementation and resource replication over different physical locations.
• Availability Rate: percentage of service up-time
• availability rate = total up-time / total time
• usually expressed in percentage, e.g., minimum 99.5% up-time.
16

High Availability measurement: counting nines
Percentage Uptime
Percentage Downtime
Downtime per year
Downtime per week
98%
2%
7.3 days
3h22m
99%
1%
3.65 days
1h41m
99.8%
0.2%
17h30m
20m10s
99.9%
0.1%
8h45m
10m5s
99.99%
0.01%
52.5m
1m
99.999%
0.001%
5.25m
6s
99.9999%
0.00001%
31.5s
0.6s
17

Service reliability metrics: MTBF

Defining Failure: What Is MTTR, MTTF, and MTBF?


18

On the cumulative effect of faults
• A typical hard disk: MTTF: 10-50 years
• On a cluster with 10,000 disks… about one disk will
die per year
• What happens to the MTBF of your application, as the number of elements increases?
19

Google’s Tail of Latency
Probability of one second service-level response time as the system scales and frequency of server-level high-latency outliers varies
J.Dean, L.A. Barroso, “The Tail at Scale”. Communications of the ACM, vol. 56 (2013)
20

Scalability
• Cloud services need to attend a certain workload, using a set of provisioned resources, in order to satisfactorily provide the desired performance.
• Scalability is the ability of a system to cope (e.g. maintain performance) with increased workload, by making use of additional resources.
• Elasticity refers to dynamic scalability (up and down).
21

Performance metrics
• Completion Time: how long it takes on average for the service to complete a respond to a user’s input, including the time that the request has to wait (in the queue) while the processor is finishing other tasks.
• Network latency time: how long it takes for a packet to travel from the client to the server across the Internet
• Response time: network latency + server completion time
• Throughput: amount of requests/data that can be processed per unit of time (e.g. requests / sec.)
22

Performance metrics
• Turnaround Time
• Waiting Time
• Execution Time
• Energy Consumption
23

How to observe/measure performance?
• Average is a very bad estimator
• Percentiles show a clearer picture of what users are
experiencing
response time 99th percentile
95th percentile
mean (average) median (p50)
requests
24

Performance depends on the workload
Ali-Eldin, Ahmed, et al. “Measuring cloud workload burstiness.” IEEE UCC, 2014.
25

Manageability
• Operability: Make it easy for the operations team to keep the system running smoothly
• Monitoring performance, tracking down cause of problems
• Simplicity: Make it easy for new engineers to understand the system
• Evolvability: Make it easy for engineers to make changes to the system, in the future
• Adapt to unanticipated changes
26

27

Contents
• Quality of Service
• Concerns of Cloud Applications • Cloud mechanisms
28

Base cloud mechanism?
• Cloud resource provisioning • Virtual Machines
• Containers
• PaaS
• Network resources
• Storage Resources
• Cloud-specific services….
29

Cloud horizontal scaling – replication
• Adding multiple instances of the same element/functionality
• Helps with scalability
• Even replicate across different resource pools • Private plus public cloud usage
• Multi-region, multi-datacentre scaling
• Multi- cloud provider scaling
30

Load balancers
• Essential mechanism for any horizontal scaling
• Goal: forward requests to the pool of available servers
• Transparent to clients
• Need to have a policy to select which resource will attend the request
• Trivial for stateless services.
• Stateful servers need the LB to remember past interactions
31

Fault Tolerance
• property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components.
32

Autoscaler systems
Cloud management system that:
1. Monitors service load
2. Requests/ releases resources when needed. How?
3. Configures new VMs/ containers
4. Configures load balancers to use updated resources
33

Sample cloud support: LB plus Autoscaling
34

SLA Monitoring
• Observe the runtime performance of cloud services to ensure that they meet the contractual Quality-of-Service (QoS) metrics requirements in the SLA.
• The system can proactively repair or failover cloud services when exception conditions occur, such as when the SLA monitor reports a cloud service as “down”.
https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/
35

Cloud application management tools
Loom: Complex large-scale visual insight for large hybrid IT infrastructure management, https://www.sciencedirect.com/science/article/pii/S0167739X16303843
36

Fail-over systems
• Goal: increase the reliability and availability of IT resources by providing redundant implementations, watching for the health of running services, and automatically switching over to a redundant or standby IT resource instance whenever the currently active IT resource becomes unavailable.
• Can span more than one geographical region so that each location hosts one or more redundant implementations of the same IT resource.
37

Proactive failure management: testing
• Test individual components whenever
• Test services
• Canary service: run simultaneously multiple versions of a service, with a small set of users being forwarded to the ‘beta’ version
• Embrace chaos; Netflix Symian Army testing strategy: inject failures on the production environment
38

Testing reliability: Netflix Chaos Monkey
• “We created Chaos Monkey to randomly choose servers in our production environment and turn them off during business hours.”
• “By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them”
https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116
39

40