EECS 485 Lecture 20
Scaling Storage
Copyright By PowCoder代写 加微信 powcoder
Midterm Student Feedback: Working Well
• Drawings
• Asking and answering questions
Midterm Student Feedback: Improvement
• Exam-style questions at end of lecture
• Piazza answers: specific responses, everything
• Learning more about provided code in projects • Legible handwriting
Story: US/Europe Data Privacy
• Can you move data from one region with more data protection laws (Europe) to one with fewer (US)?
• Previous law: “The new safeguards include a greater say for Europeans on how their information is used, the right to go to American courts when people think companies or the United States government may have misused their data, and written guarantees from American officials that government agencies will not indiscriminately collect and monitor Europeans’ data without cause.” (NYTimes)
• Struck down in 2020 by European court, new agreement last Friday
Story: Student Records
• Your student records are protected under FERPA
• Health data protected under HIPAA
• “When students seek medical care, including mental-health care, on campus, their records are not nearly as well-protected as you might think. Rather than being covered by Hipaa (the Health Insurance Portability and Accountability Act), the law that protects medical privacy, the records of students who seek care at on- campus student-health centers are covered only by Ferpa (the Family Educational Rights and Privacy Act), the law that protects student privacy.”
• “Ferpa allows on-campus health centers to release a student’s medical records, including mental-health records, to college officials under certain circumstances. One of those circumstances is when the student and the institution are involved in litigation.”
• Chronicle: “When Universities Raid Student Therapy Records”
Learning Objectives
• List common database management systems and their abilities
• State the CAP Theorem and give a description of the systems it describes
• Describe how distributed file systems are used for static file storage and serving
Distributed databases
Review: Scaling dynamic pages
Network database
• Many dynamic pages servers make network requests to one network database
• Network database: database software runs on a different server on the network
• PostgreSQL, MySQL, et al.
• Centralized network database: one server
• Distributed network database: multiple servers
Distributed network database
• Distributed network database: Many network database servers
• Problem: How to keep database servers in sync?
• Solution: Data consistency is hard, depends on
which servers contain which data
• Problem: Which servers contain which data?
• Solution 1: Sharding by content
• Solution 2: Database replication
Sharding by content
• Different rows or tables on different DB servers • Weaknesses
• Increase latency (might need to go through a master, might need to search multiple shards)
• Some searches slow
Database replication
• Multiple copies of the entire database
• Downside: all copies need to maintain same state
• Write to one needs to propagate to all the others
• Moments of inconsistency: post shows up for some people and not others
Problem with many servers
• 2 different clients connect to 2 servers and write the same data
• E.g., like the same Instagram post
Practical database examples
• We’ll compare 3 widely used software packages: • SQLite
• PostgreSQL
SQLite strengths and weaknesses
• Serverless
• File-based
• Low memory • Easy setup
Weaknesses
• No network
• No replication
MySQL strengths and weaknesses
• Distributed network database
• Fastest read operations
Weaknesses
• Partial SQL standard compliance
• Slow development
PostgreSQL strengths and weaknesses
• Distributed network database
• SQL compliance
• Balanced read and write
optimization
• Stronger consistency with simultaneous reads & writes
Weaknesses
• Reads not as fast as MySQL
SQLite vs. MySQL vs. PostgreSQL
PostgreSQL
Single-user applications
Distributed applications
Distributed applications
Small data sets
Large datasets
Large datasets
Fastest possible reads
Balanced reads and writes
Stronger consistency with simultaneous reads & writes
CAP Theorem
CAP theorem overview
• CAP Theorem describes options when designing a distributed database
CAP theorem
• 3 properties we might want our database to have
• Consistency: Every read receives the most recent write or
• Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
• Partition-tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
• CAP Theorem: you can have CP or AP but not CAP
Say hi to partner
Put in your own words
• Consistency
• Availability
• Partition-Tolerance
Network working correctly
• When the network is working correctly, you get Consistency and Availability.
• Database servers can synchronize • There is no partition
Network partition
• When a network failure occurs, database servers cannot synchronize
• This is a partition
• We have a choice
• Return an incorrect value • Return an error
• If the system is consistent and partition-tolerant, it cannot be available
• What does it mean to not be available?
• What does a CP (no A) system do when there is a partition?
• If the system is available and partition-tolerant, it cannot be consistent
• What does it mean to not be consistent?
• What does an AP (no C) system do when there is a partition?
Which to choose?
• Prioritize consistency over scaling
• Relational databases (AKA RDBMS AKA SQL) usually here • Postgresql: synchronous replication
• Prioritize scaling over consistency
• “NoSQL” databases
• MongoDB: master which is consistent, secondaries that might have stale data
• Avoid the problem
• Buy a giant server for your DB so there’s only one • Centralized network database
Distributed File Systems
Media uploads and the database
• Media uploads: Instagram photos, TikTok videos, etc.
• Problem: Media uploads are expensive to store in a database
• Solution: Store to disk
• Write once, read many times. Don’t need a database to maintain consistency.
Scaling media uploads
• Different dynamic pages servers on every request • Physical machines, virtual machines, containers
• Problem: How do dynamic pages server disks stay in sync?
• Solution: Dynamic pages servers are stateless. Store media uploads on network storage
Properties of media uploads
• Media uploads have something in common with both static pages and dynamic pages
• Media uploads are like static pages
• Once they’re uploaded, they never change
• Media uploads are like dynamic pages • Content created by users
• Access permissions per-user
Media storage implementation
• Two subproblems for media storage
• Problem 1: Store the files • Many files
• Large size
• Exabytes of data (10^18)
• Problem 2: Serve the files
• Many requests
• Permissions control per-user
Store the files
• Problem 1: Store files
• Solution 1: Network file system
• Solution 2: Distributed file system
Network file system
• Network file system: Access files over the network much like local storage (hard drive)
Network file system PaaS
• PaaS managed network file systems
• AWS Elastic File System (EFS) • Google Filestore
• Microsoft File Storage
Network file system pros and cons
• Network file system acts like a local file system • Near zero code changes
• Problem 1: Speed and scalability
• Problem 2: Fault tolerance. What happens when
network file system server goes down?
• Network file systems weren’t design with the scale of the web in mind
Distributed file system
• Distributed file system: Access files over the network, often with a special API
• Reliable, scalable file storage implemented with commodity hardware
• Google File System is an historical example
Distributed file system PaaS
• PaaS managed distributed file systems
• Google Cloud Storage • Microsoft Blob Storage
Distributed file system pros and cons
• Distributed file systems are more scalable and fault tolerant than network file systems
• Distributed file systems often use a special API • Requires code changes
Connecting dynamic pages servers to distributed file system
• Our data is now stored in a distributed file system • User connects to dynamic pages web server
• Problem: How do we upload and download?
• Solution 1: Dynamic pages server is a middleman
between client and distributed file system
• Solution 2: Client connects directly to distributed file system
Media uploads proxy
• Client sends upload to to dynamic pages server
• Dynamic pages server sends upload to distributed file system
• Example: AWS S3
• Dynamic pages server is “Middleman” or “proxy”
Media uploads proxy pros and cons
• Easy to implement
• Problem: Large file transfer goes through multiple servers. Higher latency, higher bandwidth
• Solution: Direct upload
Direct upload
• Direct upload from client to distributed file system
• Dynamic pages server generates signed URL for client
• Client uploads large file directly to distributed file system
Why not use a CDN?
• What would be the advantages and disadvantages if we used a CDN to distribute media uploads?
• Advantages: fast • Disadvantages:
• Content changes a lot, CDNs best for content that changes infrequently
• How to prevent users from accessing media they aren’t supposed to?
• Huge size -> expensive, maybe not possible
Exam-Style Question
A distributed network database has 2 servers. All queries starting from A-M are failing, but queries starting with N-Z return correctly. Which of the CAP theorem properties most likely apply?
A. Consistent and partition-tolerant, but not available B. Available and partition-tolerant, but not consistent
Learning Objectives
• List common database management systems and their abilities
• State the CAP Theorem and give a description of the systems it describes
• Describe how distributed file systems are used for static file storage and serving
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com