程序代写代做代考 crawler kernel concurrency html database file system hadoop AWS CIS 455/555: Internet and Web Systems

CIS 455/555: Internet and Web Systems
Transactions November 23, 2020
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
1

Plan for today
n PageRank
n Google
n Cloud computing
n Utility computing model n Brief overview of EC2
n Transactions n ACID properties
n Serializability
n Concurrency control: 2PL
n Distributed transactions: 2PC
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
2
NEXT

Why is this a good thing?
Electricity
n Economies of scale
n Cheaper to run one big power n
plant than many small ones n Statistical multiplexing
n High utilization! n n No up-front commitment
n No investment in generator; n pay-as-you-go model
n Scalability
n Thousands of kilowatts
available on demand; add more within seconds
n
Computing
Cheaper to run one big data center than many small ones
High utilization!
No investment in data center; pay-as-you-go model
Thousands of computers available on demand; add more within seconds
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
3

Many definitions of the cloud!
n According to NIST:
Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or
service provider interaction.
n Essential characteristics: n On-demand self service
n Broad network access n Resource pooling
n Rapid elasticity
n Measured service
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
4

History: The early days
n Cloud computing: A new term for a concept that has been around since the 1960s
n Who invented it?
n No agreement. Some candidates:
n John McCarthy (Stanford professor and inventor of Lisp; proposed the ‘service bureau’ model in 1961)
n J.C.R. Licklider (contributed key ideas to ARPANET; published a memo on the “Intergalactic Computer Network” in 1963)
n Douglas Parkhill (published a book on “The Challenge of the Computer Utility” in 1966)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
5

History: Becoming a cloud provider
Technology
Cost in medium DC (~1,000 servers)
Cost in large DC (~50,000 servers)
Ratio
Network
$95 per Mbit/sec/month
$13 per Mbit/sec/month
7.1
Storage
$2.20 per GByte/month
$0.40 per GByte/month
5.7
Administration
~140 servers/admin
>1,000 servers/admin
7.1
n Early 2000s: Phenomenal growth of web services
n Many large Internet companies deploy huge data centers, develop scalable software infrastructure to run them
n Due to economies of scale, these companies were now able to run computation very cheaply
n What else can we do with this? © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
6

Plan for today
n Cloud computing
n Utility computing model n Brief overview of EC2
n Transactions n ACID properties
n Serializability
n Concurrency control: 2PL
n Distributed transactions: 2PC
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
7
NEXT

What is Amazon AWS?
n Amazon Web Services (AWS) provides a number of different services, including:
n Amazon Elastic Compute Cloud (EC2)
Virtual machines for running custom software
n Amazon Simple Storage Service (S3)
Simple key-value store, accessible as a web service
n Amazon Elastic MapReduce Big data processing
n Amazon DynamoDB Distributed database
n Amazon Lambda
For serverless computation
n …
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
Useful for the project
8

Signing up for AWS Educate
n Complete the web form on www.awseducate.com n Assumes you already have an AWS account
n Use your Penn email address!
n Amazon says it should only take 2-5 minutes (but don’t rely on this!!)
9
n This should give you $100/year in AWS credits © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania

AWS credentials
Command-line tools SOAP APIs
AWS web site and management console
Sign-in credentials
X.509 certificates
Connecting to an instance (e.g., via ssh)
EC2 key pairs
Access keys
REST APIs
n Why so many different types of credentials? © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
10

What is Amazon EC2?
n Infrastructure-as-a-Service (IaaS)
n You can rent various types of virtual machines by the hour
n In your VMs, you can run your own (Linux/Windows) programs
n Examples:Webserver,searchengine, movie renderer, …
n Different machine types
n Current/previousgeneration
n Memory/Storage/Compute/GPU n Microinstances
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
11
http://aws.amazon.com/ec2/pricing (11/23/2020)

Demo
n Logging into AWS Management Console n Launching an instance
n Contacting the instance via ssh
n Terminating an instance
n Have a look at the AWS Getting Started guide n https://aws.amazon.com/ec2/getting-started/
n Make use of pricing calculators as well n https://calculator.s3.amazonaws.com/index.html
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
12

Avoiding costly mistakes
n When you stop working with EC2, be sure to check whether you still have resources allocated
n If you forget about a running Small instance and don’t discover it for one week, this will cost you $10!
Amazon will charge you for allocated resources, even if you are not using them. If your $100 free credit runs out, your credit card will be charged!
(Normally, your $100 credit should be plenty to finish the project.)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
13

Oh no – where has my data gone?
n EC2 instances do not have persistent storage n Data survives stops & reboots, but not termination
If you store data on the virtual hard disk of your instance and the instance fails or you terminate it,
your data WILL be lost!
n So where should I put persistent data?
n Elastic Block Store (EBS)
n Or: suspend your instances instead of terminating them (suggestion: use termination protection)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
14

Amazon Machine Images
n When I launch an instance, what software will be installed on it?
n Software is taken from an Amazon Machine Image (AMI)
n Selected when you launch an instance
n Essentially a file system that contains the operating system, applications, and potentially other data
n Lives in S3
n HowdoIgetanAMI?
n Amazon provides several generic ones, e.g., Amazon Linux, Fedora Core, Windows Server, …
n You could make your own (NOT required for the project) n Youcanevenrunyourowncustomkernel,withsomerestrictions
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
15

Security Groups
Evil attacker
Legitimate user (you or your customers)
Instance
n Basically, a set of firewall rules
n Can be applied to groups of EC2 instances
n Each rule specifies a protocol, port numbers, etc…
n Only traffic matching one of the rules is allowed through
16
n Sometimes need to explicitly open ports © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania

Regions and Availability Zones
n Where exactly does my instance run? n No easy way to find out – Amazon does not say
n Instances can be assigned to regions
n Currently 16 available: Northern Virginia, Ohio, Oregon, Northern California, Montreal, Sao Paulo, Ireland, Frankfurt, London, Singapore, Paris, Tokyo, Sydney, Seoul, Mumbai, Osaka
n Important, e.g., for reducing latency to customers
n Instances can be assigned to availability zones n Purpose: Avoid correlated faults
n Several availability zones within each region
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
17

Network pricing
n AWS does charge for network traffic
n Price depends on source and destination of traffic
n Free within most AWS svcs in same region (e.g., EC2, S3) n IfprivateIPisused
n Remember: ISPs are typically charged for upstream traffic © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
18
http://aws.amazon.com/ec2/pricing/ (11/23/2020)

Service Level Agreement
53 min downtime per year allowed
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
19
http://aws.amazon.com/ec2-sla/ (4/2/2018 version; excerpt)

AWS and your project
n What cloud resources are useful where? n Crawler: EC2 is a good fit
n CombineHW2crawlerwithHW3cross-machineparallelism
n CanrunseveralinstancesonEC2nodes
n Trytokeepdatastructuresininstance-localstorageduringthe crawl; afterwards, download copy or keep in S3
n Indexer/PageRank: EC2 or Elastic MapReduce n IndexingandPageRankaregreatMapReducejobs!
n YoucanexecutethemonEC2instances,usingyourHW3 framework, or you can run native Hadoop jobs on EMR
n StoretheindexinBDB,DynamoDB,orRDS n Frontend: EC2
n ThefrontendcanrunonanEC2instance,usingyourHW1code with custom routes for query processing
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
20

Plan for today
n Cloud computing
n Utility computing model n Brief overview of EC2
n Transactions n ACID properties
n Serializability
n Concurrency control: 2PL
n Distributed transactions: 2PC
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
21
NEXT

© 2020 A. Haeberlen, Z. Ives, V. Liu
We need more than individual RPCs
n What needs to happen when you… n Click on “purchase” on Amazon?
Suppose you purchased by credit card?
n Use online bill-paying services from your bank?
n Place a bid in an eBay-like auction system?
n Order music from iTunes?
What if your connection drops in the middle of downloading?
n Why can’t you just make a series of Web Service-like calls?
22

© 2020 A. Haeberlen, Z. Ives, V. Liu
A problem confronted by eBay
n eBay wants to sell an item to:
n The highest bidder, once the auction is over, or
n The person who’s first to click “Buy It Now!”
n But: What if the bidder doesn’t have the cash?
n A solution:
n Tentatively record the item as sold
n Validate the PayPal or credit card info with a 3rd party
n If not valid, discard this bidder and resume in prior state
23

“No Payment” isn’t the only source of failure
n Suppose we start to transfer the money, but a server goes down…
CRASH!
Purchase:
Write Item.sold = True Write Buyer.bal= Buyer.bal – $100 Write Seller.bal= Seller.bal + $100
© 2020 A. Haeberlen, Z. Ives, V. Liu
24

Transactions
n There are many (especially, financial)
applications where we want to create atomic
operations that either commit or roll back
n … despite hardware/software/application/other failures
n … whether or not other operations are executed concurrently
n This is one of the most basic services provided by database management systems, but we want to do it in a broader sense
n Part of “ACID” semantics… © 2020 A. Haeberlen, Z. Ives, V. Liu
25

ACID Semantics
n Atomicity: operations are atomic, either committing or aborting as a single entity
n Consistency: the state of the data is internally consistent
n Note: NOT like the consistency of “sequential consistency” n Isolation: all operations act as if they were
run by themselves
n Durability: all writes stay persistent!
n What would a violation of each property
look like? © 2020 A. Haeberlen, Z. Ives, V. Liu
26

Serializable Isolation (Serializability)
Node #1: Node #2:
Single Node:
T2 T4
T1 T3
T5
T6
Actual execution
Time
Hypothetical execution
Time
T1
T2
T3
T6
T4
T5
Same start state
Same result
n Recall the purpose of sequential consistency n Result is the same as if the transactions had
executed in some sequential order
n Related, but not exactly the same. Why?
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
27

Plan for today
n Cloud computing
n Utility computing model n Brief overview of EC2
n Transactions n ACID properties
n Serializability
n Concurrency control: 2PL
n Distributed transactions: 2PC
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
28
NEXT

© 2020 A. Haeberlen, Z. Ives, V. Liu
Concurrency control
n A means of ensuring that transactions are serializable
n There are many methods, of which we’ll
see one
n Lock-based concurrency control (2-phase locking)
n Optimistic concurrency control (no locks – based on
timestamps)
n Multiversion CC n…
29

Recall: Locking
Critical section
n Idea: Implement locks
n If LOCK(X) is called and X is not locked, lock X and continue n If LOCK(X) is called and X is locked, wait until X is unlocked n If UNLOCK(X) is called and X is locked, unlock X
n From Lecture 3, different ways to use locks
void transferMoney(customer A, customer B, int amount)
{
showMessage(“Transferring “+amount+” to “+B);
int balanceA = getBalance(A);
int balanceB = getBalance(B);
setBalance(B, balanceB + amount);
setBalance(A, balanceA – amount);
showMessage(“Your new balance: “+(balanceA-amount));
}
n One for each critical section
n One for each object
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
30

Two-phase locking (2PL)
© 2020 A. Haeberlen, Z. Ives, V. Liu
Strict time Non-strict time
n Strict Two-phase Locking (Strict 2PL) Protocol:
n Each transaction must obtain:
n aS(shared)lockonobjectbeforereading
n anX(exclusive)lockonobjectbeforewriting
n AnownerofanSlockcanupgradeittoXifnooneelseis holding the lock
n All locks held by a transaction are released when the transaction completes
n Locksarehandledina“growing”phase,thena“shrinking”phase
n (Non-strict)2PLVariant:Releaselocksanytime,butcannotacquire locks after releasing any lock.
#locks held
#locks held

© 2020 A. Haeberlen, Z. Ives, V. Liu
Aborting a transaction
Strict 2PL
n Shadowdata:throwitaway
n Logging:rollbackloggedchanges
Non-strict 2PL:
n If a transaction Ti is aborted, all its actions have to be undone
n Not only that, if Tj reads an object last written by Ti, Tj must be aborted as well!
n Actionsareundonebyconsultingatransactionlog n Intheworstcase:cascadingaborts

© 2020 A. Haeberlen, Z. Ives, V. Liu
A danger with locks: Deadlocks
n Deadlock: Cycle of transactions waiting for locks to be released by each other
n A waits for a lock that B holds n B waits for a lock that A holds
n Three ways of dealing with deadlocks: n Deadlock prevention
n Deadlock avoidance n Deadlock detection

© 2020 A. Haeberlen, Z. Ives, V. Liu
Deadlock avoidance
n Need to break symmetry somehow!
n When we notice that a cyclic dependency may be about to form (unsafe!),
we abort one of the involved transactions.
n Idea: Assign priorities based on timestamps
n Let’s say we have two transactions: Told and Tyoung
n Wait-die approach:
n If Told wants a lock Tyoung holds, Told waits. n If Tyoung wants a lock Told holds, Tyoung dies.
n Wound-wait approach:
n If Told wants a lock Tyoung holds, Tyoung dies.
n If Tyoung wants a lock Told holds, Tyoung waits.
n Older transactions never have to wait for younger transactions!
n If a transaction re-starts, make sure it keeps its original timestamp. Why?

Plan for today
n Cloud computing n Transactions
n ACID properties
n Serializability
n Concurrency control: 2PL
n Distributed transactions: 2PC
n Consensus n Paxos
n Fault models
n Byzantine Fault Tolerance
NEXT
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
35

Distributed Commit
Let’s COMMIT!
What happened?
But I already aborted!
n Terminology:
n Node at which a transaction originates is the coordinator n Other nodes at which it executes are subordinates
n Subordinate can’t independently abort the transaction (e.g., because of a deadlock)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
36
OK!
COMMIT
COMMIT

© 2020 A. Haeberlen, Z. Ives, V. Liu
Two-Phase Commit (2PC)
n Two rounds of communication, initiated by the coordinator:
n Phase #1: Voting
n Coordinator sends prepare messages, waits for yes or no votes
n Phase #2: Decision or termination
n Coordinator sends commit or abort messages, waits for acks
n All messages are logged
n Any site can decide to abort a transaction!

© 2020 A. Haeberlen, Z. Ives, V. Liu
Steps in 2PC
n When a transaction wants to commit:
n Coordinator sends prepare message to each subordinate
n Subordinate logs “abort” or “prepare” and then sends a no (abort) or yes (prepare) message to coordinator
n Coordinator considers votes:
n If unanimous yes votes, logs “commit” and sends commit
message to all subordinates
n Else, logs “abort”, and sends abort message
n Subordinates log abort/commit based on message they get, then send ack message to coordinator
n Coordinator writes “end” log record after getting all acks
38

Illustration of 2PC
Coordinator log begin
send “prepare”
send “yes”
log commit
send “commit”
send “ack” log end
Subordinate 1
send “prepare”
log prepare send “yes”
send “commit” log commit
send “ack”
Subordinate 2
log prepare
log commit
© 2020 A. Haeberlen, Z. Ives, V. Liu
39

What if a node reboots in the middle?
n Suppose we find a commit or abort log for transaction T, but not an end record?
n Needtoredo/undoTdependingonthelog
n IfthisnodeisthecoordinatorforT,keepsendingcommit/abort
msgs to subordinates until acks have been received
n Suppose we find a prepare log record for transaction T, but not commit/abort?
n ThisnodeisasubordinateforT
n RepeatedlycontactthecoordinatortofindstatusofT,thenwrite
commit/abort log record; redo/undo T; and write end log record n Suppose we don’t find even a prepare record for T?
n UnilaterallyabortandundoT
n This node may be coordinator! If so, subordinates may send
messages and need to also be undone © 2020 A. Haeberlen, Z. Ives, V. Liu

© 2020 A. Haeberlen, Z. Ives, V. Liu
Handling subordinate failures
n What should a node do when a remote node
does not respond during the commit protocol
for transaction T?
n If we haven’t sent commit or abort, just abort
n Else, we know the result (should still keep trying though)
n Ack msgs used to let coordinator know when it’s done with a transaction; until it receives all acks, it must keep T in the transaction- pending table

© 2020 A. Haeberlen, Z. Ives, V. Liu
Handling coordinator failures
n If it has not yet voted yes, it should abort T
n If it has voted yes? n We don’t know!
n We need to wait until the coordinator recovers – T is blocked! n Consequences?
n What if all the subordinates knew each other? n If one voted no, we can abort
n If one already received a commit/abort, we can use that
n If all voted yes, but did not receive a decision, we are still blocked!