Big Data hype!
Big Data hype!
Copyright By PowCoder代写 加微信 powcoder
Big Data hype!
So what is Big Data?
Data whose capturing, storage, curation, management, and processing are not possible with traditional tools within a tolerable amount of time
So what is Big Data?
Very large amounts of data. Right? :-\
So what is Big Data?
• Volume (hank o caain obio)
• Complexity: entities, data, hierarchies, links, relationships, …
• Variety: Structured, semi-structured, unstructured
– Text, video, audio, photos
– Wikipedia pages: text + infoboxes
– Microformat, microdata, (schema.org), …
• Velocity: near-real time
– Storage, querying, analytics, …
• Veracity: Is it really the “true” data? errors? alterations?
• Variability: data flows, peaks & valleys, …
• endetta: any of these can break a traditional approach
Somehing old
Old news to say the least
• We have always struggled with very large data
• The first International Conference on Very Large Data Bases (VLDB) was held in…
a. ‘0s b. ‘0s c. ‘0s d. ‘00s
• Then: mini-computers, mainframes, …
• Afterwards: servers, clusters, …
• Nowadays: The Cloud
Old news at least…
Data CLOUD
t1 t2 t3 t4 Time
t1 t2 t3 t4 Time
DATA SIZE / AVAILABILITY / DATA USE
SIZE OF REQUIRED COMPUTING RESOURCES
Recurring nightmares… • Scalability
– To scale up or to scale out?
• Availability and Fault Tolerance
– More on these shortly • Consistency
– Big headache…
• Performance
– Latency / throughput
– Selectivity
– Access patterns
– Transaction processing vs analytics
Physical Storage for a Database • Fundamental Limitation: Database is too large to fit in RAM
– By default, data lives on secondary storage
• But the CPU cannot directly access data not in RAM
– Data must first be loaded into main memory from secondary storage, then fetched into registers to be processed
• Secondary storage access time becomes the bottleneck – Data access takes about 10ms-60ms, instead of 30-120ns
• Alleviating this bottleneck the subject of research into data storage and database systems
What if the data is too large to fit on a single computer?
System Latencies
@ B. Gregg, “Systems Performance: Enterprise and the Cloud”, Prentice-Hall, 2013. Source: A. Silberschatz, “Operating System Concepts”, th Ed., 2012.
Storage challenges
Storage layer properties: – Efficient
TOO COSTLY
– Scalable
– Highlyavailable – Faulttolerant
– Consistent
TOO DESIRABLE
TOO FREQUENT
Processing challenges
Google ha oiled ”…
– 47% of users expect web pages to load in under
– 40% of users abandon a website that takes more than to load
– A delay (or ’’ of waiting) decreases customer satisfaction by 16%
Processing challenges
Big Data processing is a complex task…
– Large/fast/incomplete/… amounts of data…
– Complex algorithms…
– Possibly requiring multiple passes over the data… – Across large or geo-distributed datacentres …
and the clock is ticking
Availability
• Example:
– Assume a cluster with 10 PCs
– Assume that they all need to be up for a service to work
– Assume that the probability of any one being down at any point in time is 1%
– What is the probability that the service is up (i.e., available) at a random time point?
– P(PCisup)=1-P(PCisdown)=1-0.01=0.99
– P(service up) = P(all PCs up) = P(1st PC up AND 2nd PC up AND …)
– …P(st PCup)*P(2nd PCup)*…(P(PCisup))10
– … ( – P(PC is down))10 = 0.9910 = 0.90 (= 90%)
• With 100 PCs?
– … ( – P(PC is down))100 = 0.99100 = 0.37 (= 37%)
• With 10,000 PCs?
– … ( – P(PC is down))10,000 = 0.9910,000 = 2.2e-44 (= 2.2e-42 %)
– … 0.00000000 … 000000000
Somehing ne
Why me? Why this? Why now?
Data exhaust
Data hoarding
Needle in a haystack
One man’s “trash” is another man’s gold
“Mind the Trash Gap”
t1 t2 t3 t4 Time
Relevant Data
Somehing booed
Essential Big Data tools
A framework for massively-parallel access to huge volumes of data, including
A programmatic interface: MapReduce/Spark Abaseline(distributed)filesystem(HDFS)
A back-end storage system
• Scalable! Scalable! Scalable!
• Randomandsequentialaccesses
• More than just a file system …