CS作业代写 Unstructured Data Example

Unstructured Data Example
Opinion pieces, box scores, summaries, ads, user comments, etc.
1 website, many types of data

Copyright By PowCoder代写 加微信 powcoder

Unstructured Data Example cont. Business goal: Identify popular players
Download each web page (or frame) and store as a file
file1.txt → “……” file2.txt → “……..” …
Find out which players names appear most frequently?
How would you solve this problem?
Hire a data scientist to:

Real Map Reduce: Player name mentions
Solve problem with MapReduce, assume function:
bool isPlayerName (String s);
→ true if s is name of current WNBA player → false otherwise
(on the board)

Unstructured Data Example cont.
Oh, excuse me. Did I say WNBA? I meant NBA. No, I meant all professional sports? No, I meant all corporate entities?
The same problem is relevant across many domains
At some point, data doesn’t fit on your laptop How do mappers & reducers find the files they need?
 A distributed file system is the answer, e.g., Hadoop Distributed
File System (HDFS)
Don’t move data to workers… move workers to the data!
 Store data on the local disks of nodes in the cluster
 Start up the workers on the node that has the data local

Map Reduce
Programmers specify two functions: map (k1, v1) → []
reduce (k2, [v2]) → []
 All values with the same key are sent to the same reducer
The execution framework handles everything else…

“Everything Else”
The execution framework handles everything else…
 Scheduling: assigns workers to map and reduce tasks
 “Data distribution”: moves processes to data
 Synchronization: gathers, sorts, and shuffles intermediate data  Errors and faults: detects worker failures and restarts
You don’t know:
 Where mappers and reducers run
 When a mapper or reducer begins or finishes
 Which input a particular mapper is processing
 Which intermediate key a particular reducer is processing
What can you do?
 Cleverly structure intermediate data to reduce network traffic

Distributed File Systems
Companies like Google, Apple and Facebook run map reduce jobs over 10,000+ machines at multiple locations (called datacenters) hourly!
Challenge: Improve throughput for distributed file systems (and hence map reduce)

Namenode Responsibilities
Managing the file system namespace:
 Holds file/directory structure, metadata, file-to-block mapping,
access permissions, etc. Coordinating file operations:
 Directs clients to datanodes for reads and writes
 No data is moved through the namenode Maintaining overall health:
 Periodic communication with the datanodes  Block re-replication and rebalancing
 Garbage collection

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com