Shelve in
Databases/General
User level:
Beginning–Advanced
www.apress.com
SOURCE CODE ONLINE
BOOKS FOR PROFESSIONALS BY PROFESSIONALS®
Practical Business Analytics Using SAS
Practical Business Analytics Using SAS: A Hands-on Guide shows SAS users and
businesspeople how to analyze data effectively in real-life business scenarios.
The book begins with an introduction to analytics, analytical tools, and SAS
programming. The authors—both SAS, statistics, analytics, and big data experts—first
show how SAS is used in business, and then how to get started programming in SAS
by importing data and learning how to manipulate it. Besides illustrating SAS basic
functions, you will see how each function can be used to get the information you need
to improve business performance. Each chapter offers hands-on exercises drawn from
real business situations.
The book then provides an overview of statistics, as well as instruction on exploring
data, preparing it for analysis, and testing hypotheses. You will learn how to use
SAS to perform analytics and model using both basic and advanced techniques like
multiple regression, logistic regression, and time series analysis, among other topics.
The book concludes with a chapter on analyzing big data. Illustrations from banking
and other industries make the principles and methods come to life.
Readers will find just enough theory to understand the practical examples and
case studies, which cover all industries. Written for a corporate IT and programming
audience that wants to upgrade skills or enter the analytics field, this book includes:
• More than 200 examples and exercises, including code and
datasets for practice.
• Relevant examples for all industries.
• Case studies that show how to use SAS analytics to identify
opportunities, solve complicated problems, and chart a course.
Practical Business Analytics Using SAS: A Hands-on Guide gives you the tools you
need to gain insight into the data at your fingertips, predict business conditions for
better planning, and make excellent decisions. Whether you are in retail, finance,
healthcare, manufacturing, government, or any other industry, this book will help
your organization increase revenue, drive down costs, improve marketing, and satisfy
customers better than ever before.
RELATED
9 781484 200445
55999
ISBN 978-1-4842-0044-5
For your convenience Apress has placed some of the front
matter material after the index. Please use the Bookmarks
and Contents at a Glance links to access them.
v
Contents at a Glance
About the Authors ���������������������������������������������������������������������������������������������������xix
Acknowledgments ��������������������������������������������������������������������������������������������������xxi
Preface �����������������������������������������������������������������������������������������������������������������xxiii
Part 1: Basics of SAS Programming for Analytics ■ ������������������������������ 1
Chapter 1 ■ : Introduction to Business Analytics and Data Analysis Tools ��������������� 3
Chapter 2 ■ : SAS Introduction �������������������������������������������������������������������������������� 29
Chapter 3 ■ : Data Handling Using SAS ������������������������������������������������������������������� 55
Chapter 4 ■ : Important SAS Functions and Procs �������������������������������������������������� 95
Part 2: Using SAS for Business Analytics ■ ��������������������������������������� 145
Chapter 5 ■ : Introduction to Statistical Analysis ������������������������������������������������ 147
Chapter 6 ■ : Basic Descriptive Statistics and Reporting in SAS �������������������������� 165
Chapter 7 ■ : Data Exploration, Validation, and Data Sanitization ������������������������ 197
Chapter 8 ■ : Testing of Hypothesis ����������������������������������������������������������������������� 261
Chapter 9 ■ : Correlation and Linear Regression �������������������������������������������������� 295
Chapter 10 ■ : Multiple Regression Analysis �������������������������������������������������������� 351
Chapter 11 ■ : Logistic Regression ����������������������������������������������������������������������� 401
Chapter 12 ■ : Time-Series Analysis and Forecasting ������������������������������������������ 441
Chapter 13 ■ : Introducing Big Data Analytics ������������������������������������������������������ 509
Index ��������������������������������������������������������������������������������������������������������������������� 541
Part 1
Basics of SAS Programming
for Analytics
3
Chapter 1
Introduction to Business Analytics
and Data Analysis Tools
There is an ever-increasing need for advanced information and decision support systems in today’s fierce
global competitive environment. The profitability and the overall business can be managed better with
access to predictive tools—to predict, even approximately, the market prices of raw materials used in
production, for instance. Business analytics involves, among others, quantitative techniques, statistics,
information technology (IT), data and analysis tools, and econometrics models. It can positively push
business performance beyond executive experience or plain intuition.
Business analytics (or advanced analytics for that matter) can include nonfinancial variables as well,
instead of traditional parameters that may be based only on financial performance. Business analytics can
effectively help businesses, for example, in detecting credit card fraud, identifying potential customers,
analyzing or predicting profitability per customer, helping telecom companies launch the most profitable
mobile phone plans, and floating insurance policies that can be targeted to a designated segment of customers.
In fact, advanced analytical techniques are already being used effectively in all these fields and many more.
This chapter covers the basics that are required to comprehend all the analytical techniques used in
this book.
Business Analytics, the Science of Data-Driven Decision
Making
Many analytical techniques are data intensive and require business decision makers to have an
understanding of statistical and various other analytical tools. These techniques invariably require some
level of IT and database knowledge. Organizations using business analytics techniques in decision making
also need to develop and implement a data-driven approach in their day-to-day operations, planning, and
strategy making. However, in a large number of cases, businesses have no other choice but to implement
a data-driven decision-making approach because of fierce competition and cost-cutting pressures. This
makes business analytics a lucrative and rewarding career choice. This may be the right time for you to enter
this field because the business analytics culture is still in its nascent stage in most organizations around the
world and is on the verge of exploding with respect to growing opportunities.
Business Analytics Defined
Business analytics is all about data, methodologies, IT, applications, mathematical, and statistical techniques
and skills required to get new business insights and understand business performance. It uses iterative and
methodical exploration of past data to support business decisions.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
4
Business analytics aims to increase profitability, reduce warranty expenditures, acquire new customers,
retain customers, upsell or cross-sell, monitor the supply chain, improve operations, or simply reduce the
response time to customer complaints, among others. The applications of business analytics are numerous
and across industry verticals, including manufacturing, finance, telecom, and retail. The global banking and
financial industry traditionally has been one of the most active users of analytics techniques. The typical
applications in the finance vertical are detecting credit card fraud, identifying loan defaulters, acquiring
new customers, identifying responders to e-mail campaigns, predicting relationship value or profitability of
customers, and designing new financial and insurance products. All these processes use a huge amount of
data and fairly involved statistical calculations and interpretations.
Any application of business analytics involves a considerable amount of effort in defining the problem
and the methodology to solve it, data collection, data cleansing, model building, model validation, and the
interpretation of results. It is an iterative process, and the models might need to be built several times before
they are finally accepted. Even an established model needs to be revisited/rebuilt periodically for changes in
the input data or changes in the business conditions (and assumptions) that were used in the original model
building.
Any meaningful decision support system that uses data analytics thus requires development and
implementation of a strong data-driven culture within the organization and all the external entities that
support it.
Let’s take an example of a popular retail web site that aims to promote an upmarket product. To do that,
the retail web site wants to know which segment of customers it needs to target to maximize product sales
with minimum promotional dollars. To do this, the web site needs to collect and analyze customer data.
The web site may also want to know how many customers visited it and at what time; their gender, income
bracket, and demographic data; which sections of the web site they visited and in what frequency; their
buying and surfing patterns; the web browser they used; the search strings they used to get into the web site;
and other such information.
If analyzed properly, this data presents an enormous opportunity to garner useful business insights
about customers, thereby providing a chance to cut promotional costs and improve overall sales. Business
analytics techniques are capable of working with multiple and a variety of data sources to build the models
that can derive rich business insights that were not possible before. This derived rich fact base can be used
to improve customer experiences, streamline operations, and thereby improve overall profitability. In
the previous example, it is possible, by applying business analytics techniques, to target the product to a
segment of customers who are most likely to buy it, thereby minimizing the promotional costs.
Conventional business performance parameters are based mainly on finance-based indicators such
as top-line revenue and bottom-line profit. But there is more to the performance of a company than just
financial parameters. Measures such as operational efficiency, employee motivation, average employee
salary, working conditions, and so on, may be equally important. Hence, the numbers of parameters that are
used to measure or predict the performance of a company have been increased here. These parameters will
increase the amount of data and the complexity of analyzing it. This is just one example. The sheer volume
of data and number of variables that need to be handled in order to analyze consumer behavior on a social
media web site, for instance, is immense. In such a situation, conventional wisdom and reporting tools may
fail. Advanced analytics predictive modeling techniques help in such instances.
The subsequent chapters in this book will deal with data analytics. Statistical and quantitative
techniques used in advanced analytics, along with IT, provide business insights while handling a vast
amount of data that was not possible until a few years ago. Today’s powerful computing machines and
software (such as SAS) take care of all the laborious tasks of analytics algorithms coding and frees the analyst
to work on the important tasks of interpretation and applying the results to gain business insights.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
5
Is Advanced Analytics the Solution for You?
Anyone who is in a competitive business environment and faces challenges such as the following, or almost
any problem for which data is available, might be a potential candidate for applying advanced analytics
techniques:
Consumer buying pattern analysis•
Improving overall customer satisfaction•
Predicting the lead times in supply chain•
Warrant costs optimization•
Right sizing or the optimization of the sales force•
Price and promotion modeling•
Predicting customer response to sales promotion•
Credit risk analysis•
Fraud identification•
Identifying potential loan defaulters•
Drug discovery•
Clinical data analysis•
Web site analytics•
Text analysis (for instance on Twitter)•
Social media analytics•
Identifying genes responsible for a particular disease•
As discussed, business analytics is a culture that needs to be developed, implemented, and finally
integrated as a way of life in any organization with regard to decision making. Many organizations around
the world have already experienced and realized the potential of this culture and are successfully optimizing
their resources by applying these techniques.
eXaMpLe
trade-offs such as sales volumes versus price points and the costs of carrying inventory versus
the chances of stocks not available on demand are always part of day-to-day decision making for
managers. Many of these business decisions are highly subjective or based on available data that is not
that relevant.
In one such example, a company’s analysis found that the driving force of customer sentiments on
key social media sites is not its tV commercials but the interaction with the company’s call centers.
the quality of service provided by the company, and the quality of call center interaction, was largely
affecting the brand impact. Based on these insights, the company decided to divert part of the spending
on tV commercials toward improving the call center satisfaction levels. the results were clearly visible;
customer satisfaction surveys improved considerably, and there was a significant increase in customer
base and revenues.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
6
Simulation, Modeling, and Optimization
This section (and the chapter, by and large) explains the terminology and basics to build a background for
the coming chapters, which will be more focused and technical in nature.
Simulation
There are various types of simulations. In the context of analytics, computer simulations, an oft-used term,
is more relevant. Some real-world systems or scenarios might be complex and difficult to comprehend or
predict. Predicting a snow storm or predicting stock prices are classic examples. They depend upon several
variables or factors, which are practically impossible to predict. Daily stock prices, for instance, may be
affected by current political conditions, major events during the day, international business environment,
dollar prices, or simply the overall mood in the market.
There can be various levels of simulations, from simple programs that are a few hundred of lines of code
which are complex and millions of lines of code. Computer simulations used in atmospheric sciences are
another classic example where complex computer systems and software are used for weather prediction and
forecasting.
Computer simulations use various statistical models in analytics.
Modeling
A modeling is merely the mathematical logic and concepts that go into a computer program. These models,
along with the associated data, represent the real-world systems. These models can be used to study the
effect of different components and predict system behavior. As discussed earlier, the accuracy with which
a model represents a real-world system may vary and depends upon the business needs and resources
available. For instance, 90 percent accuracy in prediction might be acceptable in banking applications
such as the identification of loan defaulters, but in systems that involve human life—for instance, reliability
models in aerospace applications—accuracy of 100 percent, or as close to it, is desired.
Optimization
Optimization is a term related to computer simulations. The sole objective of some computer simulations
may be simply to ensure optimization, which in simple terms can be explained as minimization or
maximization of a mathematical function, subject to a given set of constrains. In optimization problems, a
set of variables might need to be selected from a range of available alternatives to minimize or maximize a
mathematical function while working with constraints. Although optimization is discussed here in its most
simple form, there is much more to it.
An instance of a simple optimization problem is maximizing the working time of a machine, while
keeping the maintenance costs below a certain level. If enough data is available, this kind of problem may
be solved using advanced analytics techniques. Another instance of a practical optimization problem is
chemical process factories, where an engineer may need to adjust a given set of process parameters in order
to get maximum output of a chemical reaction plant, while also keeping the costs within budget. Advanced
analytical techniques can be an alternative here as well.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
7
Data Warehousing and Data Mining
Creating a data warehouse can be considered one of the most important basics. It can give a jump start to
any business analytics project. Consider an example of a multilocational business organization with sales
offices and manufacturing plants spread across the country. Today, in almost all large establishments,
some amount of business process automation using homegrown or packaged applications such as SAP is
expected. Some processes can be local, and their transaction data might be maintained at the branch level. It
may not be possible to provide the head office with quarterly sales reports across the products and locations,
unless all the relevant data is readily available to the reporting engine. This task is easier if the company links
its branch-level data sources and makes them available in a central database. The data may need cleansing
and transformations before being loaded in the central database. This is done in order to make the raw data
more meaningful for further analysis and reporting. This central database is often called a data warehouse.
The previous instance was just one example of a data warehouse. We live in a data age. Terabytes and
petabytes of data are being continuously generated by the World Wide Web, sales transactions, product
description literatures, hospital records, population surveys, remote-sensing data by satellites, engineering
analysis results, multimedia and videos, and voice and data communications networks. The list is endless.
The sources of data in a data warehouse can be multiple and heterogeneous. Interesting patterns and useful
knowledge can be discovered by analyzing this vast amount of available data. This process of knowledge
discovery is termed data mining (Figure 1-1). The sources of data for a data mining project may be multiple,
such as a single large company-wide data warehouse or a combination of data warehouses, flat files,
Internet, commercial information repositories, social media web sites, and several other such sources.
Figure 1-1. Data mining
What Can Be Discovered Using Data Mining?
There are a few defined types of pattern discoveries in data mining. Consider the familiar example of the
bank and credit card. Bank managers are sometimes interested only in summaries of a few general features
in a target class. For instance, a bank manager might be interested in credit card defaulters who regularly
miss payment deadlines by 90 days or more. This kind of abstraction is called characterization.
In the same credit card example, the bank manager might want to compare the features of clients who
pay on time versus clients who regularly default beyond 90 days. This is a comparative study, termed as
discrimination, between two target groups.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
8
In yet another type of abstraction called association analysis, the same bank manager might be
interested in knowing how many new credit card customers also took personal loans. The bank may also
be interested in building a model that can be used as a support tool to accept or reject the new credit card
applications. For this purpose, the bank might want to classify the clients as “very safe,” “medium risky,”
and “highly risky,” as one of the steps. It might be done after a thorough analysis of a large number of client
attributes.
The bank might also be interested in predicting which customers can be potential loan defaulters, again
based on an established model, which consumes a large number of attributes pertaining to its clients. This is
predictive analysis.
To open new branches or ATMs, the bank might be interested in knowing the concentration of customers
by geographical location. This abstraction, called clustering, is similar to classification, but the names of
classes and subclasses are not known as the analysis is begun. The class names (the geography names
with sizable concentration) are known only after the analysis is complete. While doing a cluster analysis or
classification on a given client attribute, there may be some values that do not fit in with any class or cluster.
These exclusions or surprises are outliers.
Outlier values might not be allowed in some model-building activities because they tend to bias the
result in a particular direction, which may not be a true interpretation of the given data set. Such outliers
are common while dealing with $ values in data sets. Deviation analysis deals with finding the differences
between the expected and actual values.
For example, it might be interesting to know the deviation with which a model predicts the credit
card loan defaulters. Such an analysis is possible when both the model-predicted values and actual values
are present. It is, in fact, periodically done to ascertain the effectiveness of models. If the deviation is not
acceptable, it might warrant the rebuilding of the model.
Deviation analysis also attempts to find the causes of observed deviations between predicted and actual
values. This is by no means a complete list of patterns that can be discovered using data mining techniques.
The scope is much wider.
Business Intelligence, Reporting, and Business Analytics
Business intelligence (BI) and business analytics are two different but interconnected techniques. As
reported in one of SAS’s blogs, a majority of business intelligence systems aim at providing comprehensive
reporting capabilities and dashboards to the target group of users. While business analytics tools can do
reporting and dashboards, they can also do statistical analysis to provide forecasting, regression, and
modeling. SAS business analytics equips users with everything needed for data-driven decision making,
which includes information and data management and statistical and presentation tools.
Analytics Techniques Used in the Industry
The previous few sections introduced the uses of data mining or business analytics. This section will
examine the terminology in detail. Only the frequently used terms in the industry are discussed here.
Then the chapter will introduce and give examples of many of these analytics techniques and
applications. Some of the more frequently used techniques will be covered in detail in later chapters.
Regression Modeling and Analysis
To understand regression and predictive modeling, consider the same example of a bank trying to
aggressively increase its customer base for some of its credit card offerings. The credit card manager wants
to attract new customers who will not default on credit card loans. The bank manager might want to build a
model from a similar set of past customer data that resembles the set of target customers closely. This model
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
9
will be used to assign a credit score to the new customers, which in turn will be used to decide whether
to issue a credit card to a potential customer. There might be several other considerations aside from the
calculated credit score before a final decision is made to allocate the card.
The bank manager might want to view and analyze several variables related to each of the potential
clients in order to calculate their credit score, which is dependent on variables such as the customer’s age,
income group, profession, number of existing loans, and so on. The credit score here is a dependent variable,
and other customer variables are independent variables. With the help of past customer data and a set of
suitable statistical techniques, the manager will attempt to build a model that will establish a relationship
between the dependent variable (the credit score in this case) and a lot of independent variables about
the customers, such as monthly income, house and car ownership status, education, current loans already
taken, information on existing credit cards, credit score and the past loan default history from the federal
data bureaus, and so on. There may be up to 500 such independent variables that are collected from a variety
of sources, such as credit card application, federal data, and customers’ data and credit history available with
the bank. All such variables might not be useful in building the model. The number of independent variables
can be reduced to a more manageable number, for instance 50 or less, by applying some empirical and
scientific techniques. Once the relationship between independent and dependent variables is established
using available data, the model needs validation on a different but similar set of customer data. Then it
can be used to predict the credit scores of the potential customers. A prediction accuracy of 90 percent
to 95 percent may be considered good in banking and financial applications; an accuracy of 75 percent is
must. This kind of predictive model needs a periodic validation and may be rebuilt. It is mandatory in some
financial institutions to revalidate the model at least once a year with renewed conditions and data.
In recent times, revenues for new movies depend largely on the buzz created by that movie on social media
in its first weekend of release. In an experiment, data for 37 movies was collected. The data was the number of
tweets on a movie and the corresponding tickets sold. The graph in Figure 1-2 shows the number of tweets on
the x-axis and number of tickets sold on the y-axis for a particular movie. The question to be answered was,
If a new movie gets 50,000 tweets (for instance), how many tickets are expected to be sold in the first week?
A regression model was used to predict the number of tickets (y) based on number of tweets (x) (Figure 1-3).
Figure 1-2. Number of Tickets Sold vs. Number of Tweets—a data collection for sample movies
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
10
Figure 1-3. The regression model for Number of Tickets Sold vs. Number of Tweets—prediction using
regression model
Figure 1-4. A time series forecast showing the seasonality of demand
Using the previous regression predictive model equation, the number of tickets was estimated to be
5,271 for a movie that had 50,000 tweets in the first week of release.
Time Series Forecasting
Time series forecasting is a simple form of forecasting technique, wherein some data points are available over
regular time intervals of days, weeks, or months. If some patterns can be identified in the historical data, it
is possible to project those patterns into the future as a forecast. Sales forecasting is a popular usage of time
series forecasting. In Figure 1-3, a straight line shows the trend from the past data. This straight line can easily
be extended into a few more time periods to have fairly accurate forecasts. In addition to trends, time series
forecasts can also show seasonality, which is simply a repeat pattern that is observed within a year or less
(such as more sales of gift items on occasions such as Christmas or Valentine’s Day). Figure 1-4 shows an
actual sales forecast, the trend, and the seasonality of demand.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
11
Figure 1-4 shows the average monthly sales of an apparel showroom for three years. There is a stock
clearance sale every four months, with huge discounts on all brands. The peak in every fourth month is
apparent from the figure.
Time series analysis can also be used in the example of bank and credit card to forecast losses or profits
in future, given the same data for a historical period of 24 months, for instance. Time series forecasts are also
used in weather and stock market analysis.
Other examples of time series data include representations of yearly sales volumes, mortgage interest
rate variations over time, and data representations in statistical quality control measurements such as
accuracy of an industrial lathe machine for a period of one month. In these representations, the time
component is taken on the x-axis, and the variable, like sales volume, is on the y-axis. Some of these trends
may follow a steady straight-line increase or a decline over a period of time. Others may be cyclic or random
in nature. While applying time series forecasting techniques, it is usually assumed that the past trend will
continue for a reasonable time in the future. This future forecasting of the trend may be useful in many
business situations, such as stocks procurement, planning of cash flow, and so on.
Conjoint Analysis
Conjoint analysis is a statistical technique, mostly used in market research, to determine what product (or
service), features, or pricing would be attractive to most of the customers in order to affect their buying
decision positively.
In conjoint studies, target responders are shown a product with different features and pricing levels.
Their preferences, likes, and dislikes are recorded for the alternative product profiles. Researchers then apply
statistical techniques to determine the contribution of each of these product features to overall likeability
or a potential buying decision. Based on these studies, a marketing model can be made that can estimate
the profitability, market share, and potential revenue that can be realized from different product designs,
pricing, or their combinations.
It is an established fact that some mobile phones sell more because of their ease of use and other user-
friendly features. While designing the user interface of a new phone, for example, a set of target users is
shown a carefully controlled set of different phone models, each having some different and unique feature
yet very close to each other in terms of the overall functionality. Each user interface may have a different
set of background colors; the placement of commonly used functions may also be different for each phone.
Some phones might also offer unique features such as dual SIM. The responders are then asked to rate the
models and the controlled set of functionalities available in each variation. Based on a conjoint analysis of
this data, it may be possible to decide which features will be received well in the marketplace. The analysis
may also help determine the price points of the new model in various markets across the globe.
Cluster Analysis
The intent of any cluster analysis exercise is to split the existing data or observations into similar and discrete
groups. Each observation is divided groupwise in classification type of problems, while in cluster analysis, the
aim is to determine the number and composition of groups that may exist in a given data or observation set.
For example, the customers could be grouped into some distinct groups in order to target them with
different pricing strategies and customized products and services. These distinct customer groups (Figure 1-5)
may include frequent customers, occasional customers, high net worth customers, and so on. The number
of such groups is unknown when beginning the analysis but is determined as a result of analysis.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
12
The graph in Figure 1-6 shows the debt to income ratio versus age. Customer segments that are similar
in nature can be identified using cluster analysis.
Figure 1-6. Debt to Income Ratio vs. Age
Figure 1-5. A cluster analysis plot
The debt to income ratio in Figure 1-6 is low for age groups 20 to 30. The 30-to-45 age group segment
has a higher debt to income ratio. The three groups need to be treated differently instead of as one single
population, depending on the business objective.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
13
Segmentation
Segmentation is similar to classification, where the criteria to divide observations into distinct groups needs
to be found. The number of groups may be apparent even at the beginning of the analysis, while the aim
of cluster analysis is to identify areas with concentrations different than other groups. Hence, clustering is
discovering the presence of boundaries between different groups, while segmentation uses boundaries or
some distinct criterion to form the groups.
Clustering is about dividing the population into different groups based on all the factors available.
Segmentation is also dividing the population into different groups but based on predefined criteria such as
maximizing the profit variable, minimizing the defects, and so on. Segmentation is widely used in marketing
to create the right campaign for the customer segment that yields maximum leads.
Principal Components and Factor Analysis
These statistical methodologies are used to reduce the number of variables or dimensions in a model
building exercise. These are usually independent variables. Principal component analysis is a method
of combining a large number of variables into a small number of subsets, while factor analysis is a
methodology used to determine the structure or underlying relationship by calculating the hidden factors
that determine the variable relationships.
Some analysis studies may start with a large number of variables, but because of practical constraints
such as data handling, data collection time, budgets, computing resources available, and so on, it may be
necessary to drastically reduce the number of variables that will appear on the final data model. Only those
independent variables that make most sense to the business need to be retained.
There might also be interdependency between some variables. For example, income levels of
individuals in a typical analysis might be closely related to the monthly dollars they spend. The more the
income, the more the monthly spend. In such a case, it is better to keep only one variable for the analysis
and remove the monthly spend from the final analysis.
The regression modeling section discussed using 500 variables as a starting point to determine the credit
score of potential customers. The principal component analysis can be one of the methods to reduce the number
of variables to a manageable level of 40 variables (for example), which will finally appear in the final data model.
Correspondence Analysis
Correspondence analysis is similar to principal component analysis but applies to nonquantitative or
categorical data such as gender, status of pass or fail, color of objects, and field of specialization. It especially
applies to cross-tabulation. Correspondence analysis provides a way to graphically represent the structure of
cross-tabulations with each row and column represented as a point.
Survival Analytics
Survival analytics is typically used when variables such as time of death, duration of a hospital stay, and
time to complete a doctoral thesis need to be predicted. It basically deals with the time to event data. For a
more detailed treatment of this topic, please refer to www.amstat.org/chapters/northeasternillinois/
pastevents/presentations/summer05_Ibrahim_J.pdf.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
14
Some Practical Applications of Business Analytics
The following sections discuss a couple of examples on the practical usage of application of business analytics
in the real world. Predicting customer behavior towards some product features, or application of business
analytics in the supply chain to predict the constraints, such as raw material lead times, are very common
examples. Applications of analytics are very popular in retail and predicting trends on social media as well.
Customer Analytics
Predicting consumer behavior is the key to all marketing campaigns. Market segmentation, customer
relationship management, pricing, product design, promotion, and direct marketing campaigns can benefit
to a large extent if consumer behavior can be predicted with reasonable accuracy. Companies with direct
interaction with customers collect and analyze a lot of consumer-related data to get valuable business
insights that may be useful in positively affecting sales and marketing strategies. Retailers such as Amazon
and Walmart have a vast amount of transactional data available at their disposal, and it contains information
about every product and customer on a daily basis. These companies use business analytics techniques
effectively for marketing, pricing policies, and campaign designs, which enable them to reach the right
customers with the right products. They understand customer needs better using analytics. They can swap
better-selling products at the cost of less-efficient ones. Many companies are also tapping the power of social
media to get meaningful data, which can be used to analyze consumer behavior. The results of this analytics
can also be used to design more personalized direct marketing campaigns.
Operational Analytics
Several companies use operational analytics to improve existing operations. It is now possible to look
into business processes in real time for any given time frame, with companies having enterprise resource
planning (ERP) systems such as SAP, which give an integrated operational view of the business. Drilling
down into history to re-create the events is also possible. With proper analytics tools, this data is used to
analyze root cases, uncover trends, and prevent disasters. Operational analytics can be used to predict lead
times of shipments and other constraints in supply chains. Some software can present a graphical view of
supply chain, which can depict any possible constraints in events such as shipments and production delays.
Social Media Analytics
Millions of consumers use social media at any given time. Whenever a new mobile phone or a movie, for
instance, is launched in the market, millions of people tweet about it almost instantly, write about their
feelings on Facebook, and give their opinions in the numerous blogs on the World Wide Web. This data, if
tapped properly, can be an important source to uncover user attitudes, sentiments, opinions, and trends.
Online reputation and future revenue predictions for brands, products, and effectiveness of ad campaigns
can be determined by applying proper analytical techniques on these instant, vast, and valuable sources of
data. In fact, many players in the analytics software market such as IBM and SAS claim to have products to
achieve this.
Social media analytics is simply text mining or text analytics in some sense. Unstructured text data
is available on social media web sites, which can be challenging to analyze using traditional analytics
techniques. (Describing text analytics techniques is out of scope for this book.)
Some companies are now using consumer sentiment analysis on key social media web sites such
as Twitter and Facebook to predict revenues from new movie launches or any new products introduced
in the market.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
15
Data Used in Analytics
The data used in analytics can be broadly divided into two types: qualitative and quantitative. The
qualitative, discrete, or categorical data is expressed in terms of natural languages. Color, days of a week,
street name, city name, and so on, fall under this type of data. Measurements that are explained with the
help of numbers are quantitative data, or a continuous form of data. Distance between two cities expressed
in miles, height of a person measured in inches, and so on, are forms of continuous data.
This data can come from a variety of sources that can be internal or external. Internal sources include
customer transactions, company databases and data warehouses, e-mails, product databases, and the
like. External data sources can be professional credit bureaus, federal databases, and other commercially
available databases. In some cases, such as engineering analysis, a company may like developing its own
data to solve an uncommon problem.
Selecting the data for building a business analytics problem requires a thorough understanding of the
overall business and the problem to be resolved. The past sections discussed that an analytics model uses
data combined with the statistical techniques used to analyze it. Hence, the accuracy of any model is largely
dependent upon the quality of underlying data and statistical methods used to analyze it.
Obtaining data in a usable format is the first step in any model-building process. You need to first
understand the format and content of the raw data made available for building a model. Raw data may require
extraction from its base sources such as a flat file or a data warehouse. It may be available in multiple sources
and in a variety of formats. The format of the raw data may warrant separation of desired field values, which
otherwise appear to be junk or have little meaning in its raw form. The data may require a cleansing step as
well, before an attempt is made to process it further. For example, a gender field may have only two values of
male and female. Any other value in this field may be considered as junk. However, it may vary depending
upon the application. In the same way, a negative value in an amounts field may not be acceptable.
In some cases, the size of available data may be so large that it may require sampling to reduce it to a
manageable form for analysis purposes. A sample is a subset from the available data, which for all practical
purposes represents all the characteristics of the original population. The data sourcing, extraction,
transformation, and cleansing may eat up to 70 percent of total hours made available to a business analytics
project.
Big Data vs. Conventional Business Analytics
Conventional analytical tools and techniques are inadequate to handle data that is unstructured (like text
data), that is too large in size, or that is growing rapidly like social media data. A cluster analysis on a 200MB
file with 1 million customer records is manageable, but the same cluster analysis on 1000GB of Facebook
customer profile information will take a considerable amount of time if conventional tools and techniques
are used. Facebook as well as entities like Google and Walmart generate data in petabytes every day.
Distributed computing methodologies might need to be used in order to carry out such analysis.
Introduction to Big Data
The SAS web site defines big data as follows:
Big data is a popular term used to describe the exponential growth and availability of
data, both structured and unstructured. And big data may be as important to business—
and society—as the Internet has become. Why? More data may lead to more accurate
analyses.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
16
It further states, “More accurate analyses may lead to more confident decision making. And better
decisions can mean greater operational efficiencies, cost reductions, and reduced risk.”
This definition refers to big data as a popular term that is used to describe data sets of large volumes
that are beyond the limits of commonly used desktop database and analytical applications. Sometimes even
server-class databases and analytical applications fail to manage this kind of data set.
Wikipedia describes big data sizes as a constantly moving target, ranging from a few dozen terabytes
to some petabytes (as in 2012) in a single data set. The size of big data may vary from one organization to
the other, depending on the capabilities of software that are commonly used to process the data set in its
domain. For some organizations, only a few hundred gigabytes of data may require reconsideration using
their data processing and analysis systems, while some may feel quite at home with even hundreds of
terabytes of data.
CONSIDer a FeW eXaMpLeS
Cern’s Large Hydron Collider experiments deal with 500 quintillion bytes of data per day, which is
200 times more than all other sources combined in the world.
In Business
eBay.com uses a 40 petabytes hadoop cluster to support its merchandising, search, and consumer
recommendations.
amazon.com deals with some of the world’s largest linux databases, which measure up to 24 terabytes.
Walmart’s daily consumer transactions run into 1 million per hour. Its databases sizes are estimated to
be 2.5 petabytes.
all this, of course, is extremely large data. It is almost impossible for conventional database and
business applications to handle it.
The industry has two more definitions for big data.
Big data is a collection of data sets so large and complex that it becomes difficult to •
process using on-hand database management tools or traditional data processing
applications.
Big data is the data whose scale, diversity, and complexity requires new architecture, •
techniques, algorithms, and analytics to manage it and extract value and hidden
knowledge from it.
In simple terms, big data cannot be handled by conventional data-handling tools, and big data analysis
cannot be performed using conventional analytical tools. Big data tools that use distributed computing
techniques are needed.
Chapter 1 ■ IntroduCtIon to BusIness analytICs and data analysIs tools
17
Big Data Is Not Just About Size
Gartner defines the three v’s of big data as volume, velocity, and variety. So far, only the volume aspect of big
data has been discussed. In this context, the speed with which the data is getting created is also important.
Consider the familiar example of the Collider experiments; it annually generates 150 million
petabytes of data, which is about 1.36EB (1EB = 1073741824GB) per day. Per-hour transactions for Walmart
are more than 1 million.
The third v is variety. This dimension refers to the type of formats in which the data gets generated. It can be
structured, numeric or non-numeric, text, e-mail, customer transactions, audio, and video, to name just a few.
In addition to these three v’s, some like to include veracity while defining big data. Veracity includes the
biases, noise, and deviation that is inherent in most big data sets. It is more common to the data generated from
social media web sites. The SAS web site also counts on data complexity as one of the factors for defining big data.
Gartner’s definition of the three v’s has almost become an industry standard when it comes to defining
big data.
Sources of Big Data
Some of the big data sources have already been discussed in the earlier sections. Advanced science studies
in environmental sciences, genomics, microbiology, quantum physics, and so on, are the sources of data sets
that may be classified in the category of big data. Scientists are often struck by the sheer volume of data sets
they need to analyze for their research work. They need to continuously innovate ways and means to store,
process, and analyze such data.
Daily customer transactions with retailers such as Amazon, Walmart, and eBay also generate large
volumes of data at amazing rates. This kind of data mainly falls under the category of structured data.
Unstructured text data such as product descriptions, book reviews, and so on, is also involved. Healthcare
systems also add hundreds of terabytes of data to data centers annually in the form of patient records and
case documentations. Global consumer transactions processed daily by credit card companies such as Visa,
American Express, and MasterCard may also be classified as sources of big data.
The United States and other governments also are big sources of data generation. They need the power
of some of the world’s most powerful supercomputers to meaningfully process the data in reasonable time
frames. Research projects in fields such as economics and population studies, conducted by the World Bank,
UN, and IMF, also consume large amounts of data.
More recently, social media sites such as Facebook, Twitter, and LinkedIn are presenting some great
opportunities in the field of big data analysis. These sites are now among some of the biggest data generation
sources in the world. They are mainly the sources of unstructured data. Data forms included here are text
data such as customer responses, convers