Cloud and Big Data
Sambit Sahu IBM Research
2
Course Objective
§ Graduate level course on Cloud Computing
– Focus is on learning and building extremely large scale systems and applications
leveraging Cloud
– Building blocks and design patterns in designing backend of typical Internet Scale application
– Learn concepts as well as hands-on experience by using real cloud and cloud technologies.
– Three key objectives: learn how to use a cloud, leverage cloud to build applications, build scalable intelligent systems
§ We shall learn cloud technologies by using real clouds and services – Amazon AWS, Google Cloud, Hadoop/Spark, Kafka, Elastic, Dynamo etc.
§ Required background
– Programming experience with one of the following Java/Python, web services basics – Operating Systems concepts, networking concepts would help you understand more
–> If you are not familiar with web services, take a look at materials on any web application design technologies.
3
What would you learn in this course…
§HowTo
– How to use a Cloud as a compute node?
– How to use cloud to design an Internet scale application? – How to process a very large amount of data?
– How to build your own cloud using open source?
§ Concepts: Building Blocks
– Virtualization, Containers, Serverless
– Peta-byte scale storage systems
– Event and messaging systems (Kafka)
– noSQL datastore (Cassandra, mongo, DynamoDB,…) – Elastic Search
– Compute in a cluster
– Intelligent AI applications
–…
§ Case studies with real systems/cloud
§ Compute Cloud, Storage Cloud, Data Cloud
Main Modules
§ Cloud Platform and Programming
– Basic cloud concepts
– Hands-on experience with Amazon AWS Cloud – Virtualization as an enabling technology
– Virtualization vs Containers vs Serverless
– Build a Web application leveraging cloud
§ Building Blocks in an Extremely Large Scale Application – Scalable data store and noSQL database
– Message Queues: Kafka
– Unstructured data and queries: Elastic Search
– In-memory data store
– devOps: Containers, micro-services, logging and monitoring
– Build a scalable application using scalable, event-driven pattern
§ Private Cloud
– Understand key concepts for building a cloud
– Use Openstack cloud management stack
– devops/chef/puppet for private cloud automation – Build your own cloud
§ Big Data Computing Platform and Programming – Hadoop eco-system, and batch data processing & storage
– MapReduce, Hive, Hbase
– Spark and Spark Streams
– Intelligent Real-time system design using Spark 4
Tentative Syllabus/Lectures
§Intro to Cloud: IaaS, PaaS, SaaS cloud, AWS, GCP, Azure Cloud (but we focus on building using AWS)
§ Designing a web application using cloud
§ Virtualization as Cloud Enabling technology; Virtualization vs Containers
§ Building Private Cloud (OpenStack)
§ DevOps in a Cloud and Micro-services Architecture
§ Designing Extremely Large Scale Applications – Message Queue (Kafka)
– Event Notification
– Scalable no-SQL
– Lambda architecture
– Indexing and searching unstructured data (Elastic Search)
§ Computing in a Cluster – Hadoop/MR
– Spark based compute model
§ Use cases: Designing Intelligent Services in a Cloud – we will use a variety of AWS
ML and Google ML APis to design interesting use case 5
Tentative Course Schedule
Date
Topic
Reading List
09/08
Intro to Cloud
09/15
Cloud Programming
GFS
09/22
Designing Scalable Web Application
BigTable
09/29 [A1]
Designing Web Scale Applications
Kafka
10/06
Message Queues and Logs
Cassandra, DynamoDB
10/13
noSQL database, Elastic Search
MapReduce
10/20 Quiz1 [A2]
Containers, Kubernetes, devops
anthos
10/27
Cluster Computer: Spark
spark
11/03
Spark Data Frames
Borg
11/10 [A3]
Spark Advanced
spanner
11/17
Private Cloud
11/24
Intelligent Systems
12/01 Quiz 2
Advanced Topics
12/08
Advanced Topics
12/15
Final Demo
6
7
Course Material
§ Lecture Notes
– Each lecture will have a theme topic. Lecture slides will be provided for each lecture.
Additional reference materials will be specified.
§ Reading List
– A set of landmark papers in the area of large scale systems
– You submit a paper summary by answering the provided questions.
§ Three programming Assignments
§ A final Course project
§ Reference Texts – AWS in Action
– Elastic Search in Action
– Kafka Definitive Guide
– Hadoop: The Definitive Guide – Learning Spark
8
Grading and requirements
§ 2 Quizzes — 25%
§ Assignments – 35% grade
– 3 homework stressed on technologies and programming § Course project — 40% grade
– Students may team up
§ Submission process – everything to be done using Courseworks and Github
9
Project: Learn how to innovate in this space
§ Objective is to learn how to innovate in this space
§ Four phases to your project
1. Conceptandbusinessidea
2. Technologyviabilityandarchitecture 3. Executionplanningandprototyping 4. Demo,socializationandreview
§ Few suggestion
– Don’t procrastinate – start early. Motivation: Would help you get A+ (and earn
millions!)
– Form your team carefully – asking, interviewing your team mates. Float around some
ideas,, kick the tire. Take a look at lot of recent startups that are bought by Google,
Apple, FB, Amazon etc. Take a look at beta.list
– Cloud + Social + Mobile is a good recipe for a perfect storm
10
What you need to do soon
§ Get account on few popular clouds
– Amazon AWS (EC2, S3)
– Google Cloud Platform, Google Storage
– We are working with Amazon to get free accounts
§ Course Project
– Substantial portion of your grade depends on final course project
– I will provide a set of project categories that you could choose from or come up with your
own. But each project category will have a set of criteria that need to be demonstrated
– You need to have a team and a project proposal by 02/11/20 5:00pm
What is Cloud?
§ Allows users to request computing/storage resources through web interfaces §You do not need to own or install or manage these resources.
§ Pay as you go – Resources on-demand
§ Elastic: Use as much as you want or as less as you want
– Users can assume infinite amount of compute and storage resources are available.
– Users can request resources when and what they need and release/remove resources
when they don’t need.
§Compute and storage resources are now treated as software entities. You get
access to such resources programmatically – not by physical hardware anymore!
§ So what are the Clouds! Where are the Cloud?
§ Read this paper: http://cacm.acm.org/magazines/2010/4/81493-a-view-of-cloud- computing/fulltext
11
12
Why Cloud?
§You can get as many as 1000 machines for an hour for a few dollars to run a complex application!
§You don’t need to manage, maintain or fix any machines!
§You can use as little as 1 machine or as many as 10000 machines depending on
what your current needs are!
§ Two key focus: on-demand and elastic!
13
Essential Characteristics
§ On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service’s provider.
§ Broad network access. Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).
§ Resource pooling. The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, network bandwidth, and virtual machines.
§ Rapid elasticity. Capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
§ Measured Service. Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service.
14
Service Models
§ Cloud Software as a Service (SaaS). The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web- based email). The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.
§ Cloud Platform as a Service (PaaS). The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations.
§ Cloud Infrastructure as a Service (IaaS). The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).
15
Deployment Models
§ Private cloud. The cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on premise or off premise.
§ Community cloud. The cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on premise or off premise.
§ Public cloud. The cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services.
§ Hybrid cloud. The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds).
Berkeley View of Cloud Definition
16
§ IaaSàSaaS Provider -àSaaS User
Source: Above the Clouds: A Berkeley View of Cloud Computing
Different types of utility model
§ IaaS Cloud (Amazon EC2)
– Low level of computing resource abstraction
– Provides a (virtual) machine to users
– Makes it hard for IaaS providers to support automatic
scaling, failover etc.
§ Google AppEngine
– Targeted at web applications
– Enforces an application structure
– Clean separation between stateless and stateful
storage tier
– Benefit: makes it possible to handle auto-scaling, fail
over/high availability
§ Microsoft Azure
– Applications need to be written using .NET libraries
– More flexible than Google AppEngine
– Able to provide some automated scaling
– Between Application framework and hardware virtual
machines 17
Different Cloud Offerings: A Layered Perspective
18
§ Higher the stack, less control but more automation for user
§ Lower the stack, more control but more responsibility for user
19
Example Clouds and Usage Scenario
§ IaaS
– Amazon EC2, Rackspace
§ PaaS
– Google AppEngine
– Microsoft Azure
§ SaaS
– salesforce.com
§ Roll your own
– Open Source software stack
• Open Nebula
• Eucalyptus • Openstack
§ Machine level abstraction
– User requests a machine with desired CPU, mem, disk
possibly with a preconfigured OS and software
– IaaS Cloud provides a virtual server with (minimal) pre-
installed software such as OS
§ Platform level abstraction
– User writes application using PaaS defined interfaces – PaaS provides platform to support the deployment and
management of this application § SaaS
– salesforce.com
§ User installs and adapts to build own Cloud
20
Cloud Computing Economics
§ Three useful usage scenarios
– Load varying with time
– Demand unknown in advance
– Batch analytics that can benefit from huge number of resources for a short time duration
§ Why pay-as-you-go model makes sense economically even if costs higher than buying a server and depreciating the h/w – Extreme elasticity
– Transference of risk (of over provisioning)
21
Source: Above the Clouds: A Berkeley View of Cloud Computing
Top obstacles and opportunities for Cloud
22
Source: Above the Clouds: A Berkeley View of Cloud Computing
IaaS Cloud Example: Amazon EC2
§ Amazon EC2 provides public IaaS Cloud
§ User uses a portal to request a machine with specific resource
– CPU, memory, disk space
– Pre-built OS and possibly middleware
23
PaaS Cloud: Google App Engine
§ PaaS model
§ Provides a platform to host web applications
§ App Engine SDK for programming (Python and Java support)
§ A set of primitives (datastore, URL fetch, memcache, JavaMail, Images, authentication..)
§ User focuses on developing the application in this framework
§ Once deployed, scaling, availability etc. are handled by Google AppEngine platform
24
Let’s use a IaaS Cloud (Amazon EC2) § http://aws.amazon.com/console/
§ Amazon EC2 console based provisioning demo
25
Traditional vs Cloud-based Application
26
Leveraging Cloud Services to Quickly Build Complex Applications
287
Amazon Cloud Services: Accessing through Web APIs
28
Various Methods to Access AWS
29
Amazon AWS console (EC2 view)
§ User logs in with AWS credentials 30
User launches request instanceàa list of prebuilt stack is provided
§ AWSshowsalistofavailablepre-builtbasesoftwarestack(calledVirtualAppliances)usermayrequesttoaddtothemachine
31
User can choose the resource size (CPU, mem choices)
§ Instance request wizard guides through resource choices 32
User specifies security/access configurations
33
AWS provisions an instance and returns user credentials
34
Next Week
§ Account setup and testing
– Sign up for AWS account. Sign up for AWS EC2 and S3 services.
– Create a micro instance with Amazon Linux stack with appropriate keys and access control using AWS portal. SSH into the instance you created.
– Read Chapter 1 and 3 from AWS in Action book.
– Assignment 0
– Building Modern Web Application (Just complete Module 1) by following this link:
https://aws.amazon.com/getting-started/hands-on/build-modern-app-fargate-lambda- dynamodb-python/
35
Some additional links
§ Hands-on Tutorials on AWS: https://aws.amazon.com/getting-started/hands-on/ § https://aws.amazon.com/solutions/case-studies/
§ http://aws.amazon.com/awscredits
36