Introduction Big data MapReduce Hadoop Applications Example Google Trends
CORPFIN 2503 – Business Data Analytics: Big data
Week 12: October 25th, 2021
£ius CORPFIN 2503, Week 12 1/45
Copyright By PowCoder代写 加微信 powcoder
Introduction Big data MapReduce Hadoop Applications
Google Trends
Introduction Big data MapReduce Hadoop Applications Example Google Trends
CORPFIN 2503, Week 12
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Big data refers to data sets of large volumes that are beyond the limits of commonly used desktop database and analytical applications.
• CERN’s Large Hydron Collider Data Centre processes on average one petabyte (one million gigabytes) of data per day.
• Facebook, Google, and Walmart generate data in petabytes every day (1 PB = 1,024 TB).
£ius CORPFIN 2503, Week 12 3/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Big data II
Web searches of big data in Google over time:
Why has big data become so popular in recent years?
£ius CORPFIN 2503, Week 12 4/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Big data III
A few reasons, including:
• Decreased hardware costs, e.g. RAM prices:
Price of 1 MB
$1.12 $0.185 $0.0122
Source: p. 32 in Big Data, Data Mining, and Machine Learning by J. Dean.
• Improved hardware (e.g., faster CPUs)
• Networks became faster
• Advances in analysis methods and programming.
£ius CORPFIN 2503, Week 12 5/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Walmart’s example
`Each week, nearly 265 million customers and members visit approximately 11,500 stores under 56 banners in 27 countries and eCommerce websites.’
(Source: https://corporate.walmart.com/our-story)
Suppose Walmart wants to know its customers’ buying patterns for the last month. Conservative estimate for the sample is:
(265m customers) × (2 visits per month) × (20 items bought per visit) = 10.6b rows
Suppose their data set has 6 columns only: date, item, number of items, price, discount, net price.
£ius CORPFIN 2503, Week 12 6/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Walmart’s example II
If we assume that the le size of the data of 10,000 rows and 6 columns per row, is 210KB then Walmart’s monthly data le size is 212GB.
Excel’s worksheet contains around 1 million rows.
£ius CORPFIN 2503, Week 12 7/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
How big is Big data?
Very big. Some examples:
• Walmart has a database with more than 2.5 petabytes of data. • Facebook handles 300 petabytes of data daily.
• Google handles nearly 200 petabytes of data daily.
These numbers are a few years old.
£ius CORPFIN 2503, Week 12 8/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
How big is Big data? II
There are 2.5 quintillion bytes of data created each day at our current pace.
Over the last two years alone 90% of the data in the world was generated.
Every minute of the day:
• Snapchat users share 527,760 photos
• More than 120 professionals join LinkedIn • Users watch 4,146,600 YouTube videos
• 456,000 tweets are sent on Twitter
• Instagram users post 46,740 photos.
Source: Forbes, How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read, 21-05-2018.
£ius CORPFIN 2503, Week 12 9/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Components of Big data
1. Volume (size)
2. Variety: audio les, video les, blogs . . .
3. Velocity: Data is growing day by day at a rapid pace. In some situations, data keeps growing so fast that we do not have enough time to analyze it.
£ius CORPFIN 2503, Week 12 10/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Big data vs normal data
Source: Konasani and Kadre (2015), Table 13-2, p. 512.
£ius CORPFIN 2503, Week 12 11/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Why to use big data?
Having more historical data often helps model predictions be more accurate.
What about sampling? =⇒ Sampling will prevent from nding outliers:
• in business, outliers are the most protable customers • for insurance companies, outliers could indicate fraud.
£ius CORPFIN 2503, Week 12 12/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Why to use big data? II
E-commerce retailers: Online retailers such as Amazon analyze petabytes of data every day to recommend the right products and improve searches. Performing big data
analytics reveals so many customer insights such as customer lifetime value, customer loyalty, product popularity, and so on.
Online entertainment industry: Sites such as Netix are using big data analytics to understand customer sentiments.
They are analyzing big data for movie recommendations and even for ticket pricing based on the demand.
Online social media industry: Social networking sites such as Facebook analyze petabytes of data for friends’
recommendations, advertisement management, and so on.
£ius CORPFIN 2503, Week 12 13/45
Introduction Big data
MapReduce Hadoop Applications Example Google Trends
Trac data:
Health data:
Why to use big data? III
Relevant authorities have very rich data on trac: how many cars are on the particular roads, their speeds, presence of trac jams etc. The analysis of such data could help improve public transit system, trac lights algorithms etc.
Public health system has very detailed data on each patient in every hospital. The analysis of such data could help optimize the public spending and hopefully improve the quality of the service provided by the public health system.
£ius CORPFIN 2503, Week 12 14/45
Introduction Big data MapReduce Hadoop Applications
Analysis methods
Less sophisticated than using normal data, e.g.: • descriptive statistics
• frequency tables
• two-way tables
• regressions: • multiple
• logit/probit
• multinomial logit.
Google Trends
CORPFIN 2503, Week 12
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Problems with big data
Conventional statistical software cannot handle such large data sets.
Sometimes it’s not even possible to save big data les on one machine.
£ius CORPFIN 2503, Week 12 16/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
The solution for big data problems
Supercomputers: Expensive…
Distributed computing: Multiple computers (PC) are connected via networks (such as LANs), each PC is working on its
individual task. Lastly, the results are assembled as one output.
£ius CORPFIN 2503, Week 12 17/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Distributed computing
The principle of divide and rule.
Instead of considering a huge data set as a single unit, the data is divided into several pieces.
Then these pieces of data are saved on an array or network of computing devices.
Each computer in this network is called a node.
Lastly, one should collate the results from all the individual machines and deliver a consolidated output.
£ius CORPFIN 2503, Week 12 18/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
In MapReduce programming model, one divides a big computing task into smaller tasks and collates the intermediate results to generate the ultimate result.
Components:
Map function: a program that divides a global task into smaller ones and nally assigns the pieces to individual
machines forming a cluster (divide program).
Reduce function: a program that sums up the results from the individual map functions to generate a nal
consolidated result.
£ius CORPFIN 2503, Week 12 19/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Map function
The map function is locally executed on all individual chunks of data.
The result of a map function comes in the form of a key-value pair.
These key-value pairs, from dierent map functions, carry the intermediate results.
Generally any individual map function is similar to the overall task. The only dierence is the size, which is considerably smaller for the map function.
The outputs of map functions are used as the inputs of the reduce function.
£ius CORPFIN 2503, Week 12 20/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Reduce function
The reduce function takes the intermediate results from the map functions and creates the nal output.
The key-value pairs generated by the map functions is sorted and aggregated in the reduce function.
The reduce function doesn’t act on the individual pieces of data.
In fact, it has no interaction with the original input data, which was fed only to the map functions.
£ius CORPFIN 2503, Week 12 21/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
How MapReduce handles big data problems
Source: Konasani and Kadre (2015), p. 516, Figure 13-1.
£ius CORPFIN 2503, Week 12 22/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Hadoop is a simplied platform to write and execute MapReduce and other big data tasks.
The Hadoop framework is an open source tool, built on the Linux operating system.
Hadoop has two major components:
• the Hadoop distributed le system (HDFS) and • MapReduce.
£ius CORPFIN 2503, Week 12 23/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
HDFS is the le system in Hadoop.
All big les will be cut into pieces.
Every le in that system will be 64MB or less.
If you transfer any le from outside to HDFS, it will be broken into pieces of 64MB.
In the case of a cluster, HDFS will cut the les into pieces and distribute them to the dierent nodes within the cluster.
There will be replication of data blocks; if one system goes down, then replicated data blocks will be used.
£ius CORPFIN 2503, Week 12 24/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
How HDFS distributes data
Source: Konasani and Kadre (2015), p. 518, Figure 13-2.
£ius CORPFIN 2503, Week 12 25/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
How HDFS distributes data II
On the gure (see previous slide), each block has been replicated three times.
By default each block size will be 64MB, and it will be replicated three times on the network.
£ius CORPFIN 2503, Week 12 26/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
HDFS and MapReduce
MapReduce is a parallel processing programing model.
MapReduce and HDFS together process big data almost eortlessly.
HDFS has all the data blocks and related information. MapReduce has all the task-related information.
The Hadoop framework takes care of coordination between the MapReduce code and the HDFS data blocks.
£ius CORPFIN 2503, Week 12 27/45
Introduction Big data MapReduce Hadoop Applications Example
Applications
Suppose you would like to store 1 PB.
Option #1:
• 125 8TB external USB drives
• Suppose each drive costs U$150; thus, U$18,750.
Google Trends
£ius CORPFIN 2503, Week 12
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Option #2:
Applications II
• Google’s Coldline storage
• Around U$7,000 per month
• The data is backed up several times and properly encrypted
• The data can be accessed instantly and securely downloaded to anywhere in the world.
Option #3:
• Amazon’s Glacier storage
• Similar to Google’s storage • Around U$4,000 per month.
CORPFIN 2503, Week 12
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Applications III
If Option #1 is chosen, computing power might be a problem.
In the case of Google’s BigQuery, it can perform a table scan over an entire petabyte in just under 3.7 minutes.
BigQuery is a specialized cluster rented in time intervals of seconds.
One can simply rent precisely the amount of compute power it requires to process the data.
In that case, for a few minutes, a few thousand CPUs will crunch a petabyte of data.
£ius CORPFIN 2503, Week 12 30/45
Introduction Big data MapReduce Hadoop Applications
US daily stock return data:
• 1926-2018 period
• almost 93 million observations.
Google Trends
£ius CORPFIN 2503, Week 12
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Example II
Let’s analyze the relation between stock returns (RET) and market average return (VWRETD).
1. estimate a regression model for the whole sample
2. split the sample into two parts, then estimate regressions separately for each subsample, and lastly, aggregate the results.
£ius CORPFIN 2503, Week 12 32/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Whole sample results
£ius CORPFIN 2503, Week 12 33/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
First subsample results
£ius CORPFIN 2503, Week 12 34/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Second subsample results
£ius CORPFIN 2503, Week 12 35/45
Introduction
Big data MapReduce Hadoop Applications Example
Google Trends
Comparison of results
Parameter estimate
0.00044 0.77044
0.00059 0.70644
0.0003 0.82972
100.53 1800.23
82.5 998.41
58.5 1680.74
Full sample Intercept VWRETD
First subsample Intercept VWRETD
Second subsample Intercept VWRETD
Average of rst and second subsamples
Intercept VWRETD
0.00045 . 0.76808 .
CORPFIN 2503, Week 12
Introduction Big data MapReduce Hadoop Applications Example
Google Trends
SAS computing time
NOTE: PROCEDURE REG used (Total process time):
real time 7:55.95
cpu time 26.19 seconds
NOTE: PROCEDURE REG used (Total process time):
real time 2:32.88
cpu time 29.31 seconds
NOTE: PROCEDURE REG used (Total process time):
real time 2:08.23
cpu time 31.84 seconds
£ius CORPFIN 2503, Week 12
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Big data at a personal level
How can we use big data at a personal level?
A few ways:
• internet search engines: Google, Bing, Baidu • Google Trends:
• https://trends.google.com/trends/
• provides access to a largely unltered sample of actual search
requests made to Google
• shows interest in a particular topic from around the globe or
down to city-level geography. • Google Books Ngram Viewer:
• https://books.google.com/ngrams
• When you enter phrases into the Google Books Ngram Viewer,
it displays a graph showing how those phrases have occurred in a corpus of books over the selected years.
£ius CORPFIN 2503, Week 12 38/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Google Trends
Google Trends provides access to aggregated information on the volume of queries for dierent search terms and how these volumes change over time.
Search query data from Google Trends might reect information gathering process that precedes the trading decisions recorded in the stock market data.
Many academic studies have used Google Trends’ data to see whether it can predict stock returns, trading volume, volatility, and similar issues.
The results are mixed.
£ius CORPFIN 2503, Week 12 39/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Google Trends II
In this paper:
• the authors show how to use search engine data to forecast near-term values of economic indicators
• examples include automobile sales, unemployment claims, travel destination planning and consumer condence.
£ius CORPFIN 2503, Week 12 40/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Google Trends III
The authors do not claim that Google Trends data can help in predicting the future.
Rather they claim that Google Trends may help in predicting the present.
For example, the volume of queries on car sales during the second week in June may be helpful in predicting the June auto sales report which is released several weeks later in July.
June queries may help to predict July sales, but more research is needed to answer this question.
Queries can be useful leading indicators for subsequent consumer purchases where consumers start planning purchases signicantly in advance of their actual purchase decision.
£ius CORPFIN 2503, Week 12 41/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Google Trends IV
They nd that investor attention is:
• strongly correlated to trading volume
• a signicant determinant of the stock market illiquidity and volatility.
£ius CORPFIN 2503, Week 12 42/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Google Trends V
They nd that:
• high Google search volumes lead to negative returns
• a trading strategy based on selling stocks with high Google
search volumes and buying stocks with infrequent Google searches is protable when the transaction cost is not taken into account but is not protable if we take into account
transaction costs.
£ius CORPFIN 2503, Week 12 43/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Cloud computing
Suppose you have access to a big data but do not have enough computing power to analyze it.
One can rent Amazon.com’s servers:
• https://aws.amazon.com/ec2/pricing/on-demand/ • many options (in terms of RAM, CPUs, storage)
• U$0.0042 U$10.848 per hour.
In 2011, hackers rented Amazon.com’s servers to conduct a cyber attack against Sony’s PlayStation Network:
• the second-largest online data breach in U.S. history
• compromised more than 100 million customer accounts
• full story: https://www.seattletimes.com/business/
playstation- security- breach- shows- amazons- cloud- appeal- for- hackers/.
£ius CORPFIN 2503, Week 12 44/45
Introduction Big data MapReduce Hadoop Applications Example Google Trends
Required reading
Konasani, V. R. and Kadre, S. (2015). Practical Business Analytics Using SAS: A Hands-on Guide: chapter 13.
£ius CORPFIN 2503, Week 12 45/45
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com