EM623-Week4a
Carlo Lipizzi
clipizzi@stevens.edu
SSE
2016
Machine Learning and Data Mining
Data mining specific tools: introduction to R with Rattle GUI
• 6th survey since 2007
• 68 questions
• 10,000+ invitations emailed, plus
promoted by newsgroups,
vendors, and bloggers
• Respondents:
1,259 data miners
from 75 countries
• Data collected in first half of 2013
Academics
Vendors*
NGO / Gov’t (6%)
Corporate
Consultants
35%
26%
15%
18%
North America
• USA 37%
• Canada 3%
• UK 5%
• France 4%
• Poland 3%
Asia Pacific
• India 5%
• Australia 3%
Europe
• Germany 8%
Central & South America (4%)
• Brazil 2%
Middle East & Africa (3%)
41%
11%
41%
*Data from software vendors is excluded from
analyses in this presentation unless otherwise noted
2
2013 Data Miner Survey: Overview
• FOCUS ON CRM: In the past few years, there has been an increase among data
miners in the already substantial area of customer-focused analytics.
Respondents are looking for a better understanding of customers and seeking to
improve the customer experience. This can be seen in their goals, analyses, big
data endeavors, and in the focus of their text mining.
• BIG DATA: Many in the field are talking about the phenomena of Big Data.
There are clearly some areas in which the volume and sources of data have
grown. However it is unclear how much Big Data has impacted the typical data
miner. While data miners believe that the size of their datasets have increased
over the past year, data from previous surveys indicate that the size of datasets
have been fairly consistent over time.
• THE ASCENDANCE OF R: The proportion of data miners using R is rapidly growing,
and since 2010, R has been the most-used data mining tool. While R is frequently
used along with other tools, an increasing number of data miners also select R as
their primary tool.
3
Some Key Findings
Question: In what fields do you TYPICALLY apply data mining? (Select all that apply)
16%
12%
14%
13%
14%
14%
14%
14%
13%
12%
13%
11%
12%
10%
12%
11%
10%
10%
24%
27%
36%
33%
32%
31%
2013
2011
Data miners also report working in Non-
profit (5%), Hospitality / Entertainment /
Sports (4%), Military / Security (2%), and
Other (10%).
• CRM / Marketing remains the #1 area to
which data mining is applied.
• The roots of data mining in customer
focused analytics are strong. In each of
the 6 Data Miner Surveys, more people
report applying their analytics in the field
of CRM / Marketing than any other field.
• In 2013, 36% of data miners indicated that
they are commonly involved in CRM /
Marketing data mining, up slightly from
2011. The number of data miners working
in the overlapping area of Retail analytics
is also increasing
4
CRM/Marketing
Academic
Pharmaceutical
Government
Manufacturing
Internet-based
Medical
Technology
Insurance
Telecommunications
Retail
Financial
Applications
3%
3%
24%
24%
24%
22%
18%
14%
12%
9%
32%
43%
60%
• Customer transactional
data often affords the
opportunity for a wide
range of analytics due to
the depth and scope of
available data
• Among respondents who
reported increases in data
volume, 60% identified
customer transaction data
as a source of their large
data sets
Question: What are the sources of data for your large datasets? (select all that apply)
5
Customer transaction data
Audio data
RFID
Image or video data
Mobile device data
Geospatial data
Web log or click stream data
Timeseries via sensors
Call center data
Social media data
Timeseries collected online
Text data
Sources of Large Data
Customer Transactions: #1 Source of Large Data
80%
70%
60%
50%
40%
30%
20%
10%
0%
2007 2008 2009 2010 2011 2012 2013
70% of data miners
report using R
24% of data miners
select R as their
primary tool
R Usage
The proportion of data miners using R is rapidly growing, and since 2010, R has been
the most-used data mining tool. While R is frequently used along with other tools, an
increasing number of data miners also select R as their primary tool. Among data
miners who say they are likely to switch their primary package in the coming year, R is
frequently identified as the tool they are plan to switch to – more than 2.5 times more
often that any other tool
6
The Popularity of R
• While data miners overall consider quality and
accuracy of model performance, dependability of
software, and data manipulation capabilities the
most important factors when choosing a data
mining tool, those using R as their primary tool
identify the ability to write one’s own code as their
most important priority.
• The quality of the user interface was rated as significantly less important by primary R users than
by other data miners.
• Interestingly, there was no difference in the stated importance of cost of tool between those
using R as their primary package and others. However, primary R users are more satisfied than
other tool users with the cost of their software (see page 33). They are also more satisfied with
the variety of available algorithms and the ability to modify algorithms to fine-tune analyses.
R is primary tool
#1: Ability to write own
code
#2: Quality & accuracy
of model performance
#3: Data manipulation
capabilities
All data miners
#1: Quality & accuracy
of model performance
#2: Dependability of
software
#3: Data manipulation
capabilities
7
• While R is heavily used among
data miners working in all
settings, in corporate settings, a
smaller proportion of data
miners report that R is their
primary tool.
Important Factors in Selecting Software
Priorities and Characteristics of R Users
18%
15%
35%
*Cluster analysis was conducted on data miners’ ratings of the importance of 22 tool selection factors.
D
B
C
E
Ease-of-use
& interface
quality are
important
Ability to
write one’s
own code is
important
Everything is important
21%
• Data miners are a diverse group who are looking for different
things from their data mining tools. They report using multiple tools
to meet their analytic needs, and even the most popular tool is
identified as their primary tool by just 24% of data miners. Over
the years, R and Rapid Miner have shown substantial increases
• Cluster analysis* reveals that, in their tool-selection preferences,
data miners fall into 5 groups. The primary dimensions that
distinguish them are price sensitivity and code-writing / interface /
ease-of-use preferences
8
Cost is not important
11%
Cost is important
A
Tool Selection
Importance of cost
Importance of ease-of-use
Importance of user
interface quality
Importance of ability to
write one’s own code
Primary tools
Tool use
Working with Big Data
Experience (years)
Very high
High
High
Low
Rapid Miner (26%)
IBM Modeler (12%)
KNIME (11%)
R (62%)
Rapid Miner (50%)
IBM Statistics (40%)
IBM Modeler (36%)
Weka (33%)
—
Many new data
miners
High
Low / Moderate
Low
Very high
R (56%)
SAS (10%)
R (90%)
Weka (37%)
SAS (33%)
Matlab (31%)
—
Few new data
miners
Moderate
Moderate
High
High
R (26%)
SAS (19%)
R (73%)
SAS (43%)
IBM Statistics (35%)
Matlab (32%)
SQL Server (32%)
SAS-EM (32%)
—
—
Low / Moderate
High
Very high
Low
STATISTICA (31%)
IBM Modeler (20%)
Rapid Miner (12%)
R (51%)
IBM Statistics (38%)
STATISTICA (37%)
IBM Modeler (32%)
Less Likely
Many new data
miners
Very high
Very high
Very high
High
R (19%)
STATISTICA (16%)
KNIME (10%)
Rapid Miner (10%)
R (73%)
IBM Statistics (35%)
Rapid Miner (34%)
Weka (32%)
SQL Server (30%)
SAS (30%)
More Likely
Many experienced
data miners
9
Tool Selection Groups
• R, IBM SPSS Statistics, Rapid Miner, and SAS are the software tools used by the most data miners. The average
data miner reports using 5 tools, but conducts 76% of their work in their primary tool. R, STATISTICA, Rapid Miner,
and SAS are the primary data mining tools chosen most often. 64% of data miners also report writing their own
code – the most common language is SQL (43%), followed by Java (26%) and Python (24%)
• The graphs below summarize the patterns of primary tool selection and overall tool usage, which vary by the
setting in which data miners work – e.g., academics are heavier users of Weka and Matlab
Overall Corporate Consultants Academics NGO / Gov’t
R
IBM SPSS Statistics
Rapid Miner
SAS
Weka
Matlab
Microsoft SQL Server
IBM SPSS Modeler
SAS Enterprise Miner
KNIME
STATISTICA
Mathematica
Minitab
SAS JMP
IBM Cognos
Oracle Advanced Analytics
C45 / C50 / See5
Orange
SAP
Salford Systems
TIBCO S+ / Spotfire Miner
KXEN
0% 20% 40% 60% 80% 0% 20% 40% 60% 80% 0% 20% 40% 60% 80% 0% 20% 40% 60% 80% 0% 20% 40% 60% 80%
10
Tool Use Varies by Employment Setting
R: Introduction
• R is “GNU S” — A language and environment for data manipulation, calculation and graphical
display.
– R is similar to the award-winning S system, which was developed at Bell Laboratories by
John Chambers et al.
– a suite of operators for calculations on arrays, in particular matrices,
– a large, coherent, integrated collection of intermediate tools for interactive data analysis,
– graphical facilities for data analysis and display either directly at the computer or on
hardcopy
– a well developed programming language which includes conditionals, loops, user defined
recursive functions and input and output facilities.
• The core of R is an interpreted computer language.
– It allows branching and looping as well as modular programming using functions.
– Most of the user-visible functions in R are written in R, calling upon a smaller set of internal
primitives.
– It is possible for the user to interface to procedures written in C, C++ or FORTRAN languages
for efficiency, and also to write additional primitives.
11
What R does and does not
• data handling and storage:
numeric, textual
• matrix algebra
• hash tables and regular
expressions
• high-level data analytic and
statistical functions
• classes (“OO”)
• graphics
• programming language:
loops, branching,
subroutines
• is not a database, but
connects to DBMSs
• has no graphical user
interfaces, but connects to
Java, TclTk
• language interpreter can
be very slow, but allows to
call own C/C++ code
• no spreadsheet view of
data, but connects to
Excel/MsOffice
• no professional /
commercial support
12
R and statistics
• Packaging: a crucial infrastructure to efficiently produce, load
and keep consistent software libraries from (many) different
sources / authors
• Statistics: most packages deal with statistics and data analysis
• State of the art: many statistical researchers provide their
methods as R packages
13
Usability
What is Usability?
• Ease of learning
• Ease of use
• Ease of remembering
• Subjective satisfaction
• Efficiency of use
• Effectiveness of use
14
Usability Engineering
• Usability Engineering (UE):
Processes to build “Usability” into products
Various methods can be used throughout the design lifecycle
Methods can be incorporated into design process easily
Methods maintain focus on user throughout design
15
Benefits of UE to an Organization
• Reduce training costs
• Reduce development costs
• Identify and fix problems early
• Reduce support costs; minimize need for
• support personnel/help desks
• fixes, maintenance, upgrades
• Enhance organization’s reputation – positive “word-of-mouth”
trade
• Larger numbers of “hit” and “return visit” rates
16
Benefits of UE to the User
• Less time to complete work
• Greater success with tasks
• Increased user satisfaction
17
GUI Design is Multi Disciplinary
• A team includes
Analyst
Designer
Technology expert
Graphic artist
Social and behavioral scientist
Programmer
18
Usability Design Process
19
• Statistics can be complex and traps await
• So many tools in R to deliver insights
• Effective analyses should be scripted
• Scripting also required for repeatability
• R is a language for programming with data
• How to remember how to do all of this in R?
• How to skill up 150 data analysts with Data Mining?
20
GUI for R
Poll from KDnuggets.com
21
Deducer
• “An intuitive, cross-platform graphical data analysis system”
• Deducer is based upon rJava and provides access to the Java Swing
Network
Related Packages Description
DeducerExtras Additional dialogs and functions for Deducer
DeducerPlugIn
Example
Deducer Plug-in Example
DeducerPlugInScaling Reliability and factor analysis plugin
DeducerSpatial Deducer for spatial data analysis
DeducerSurvival Add Survival Dialogue to Deducer
DeducerText Deducer GUI for Text Data
22
• Today, Rattle is used world wide in many industries
Health analytics
Customer segmentation and marketing
Fraud detection
Government
• It is used by
Consultants and Analytics Teams across business
Universities to teach Data Mining
• It is and will remain freely available.
• CRAN and http://rattle.togaware.com
23
Rattle
24
• Loading Data
• Basic Data Exploration
• Explore Distribution
• Explore Correlations
Walk-through
• Data miners are programmers of data
• A GUI can only do so much
• R is a powerful statistical language
• Professional data mining
• Scripting
• Transparency
• Repeatability
• The log tab: a bridge from GUI to CLI (Command Line Interface)
25
GUI limitations
Exercise 1 – Companies Values – 15’
• Data set: Companies1.xlsx
– 25 companies
• For the 3 numeric variables calculate:
– Mean, Mode, Median, Max, Min, Range
• Can you get any non explicit info from the values you calculated?
• Can you create any new variables to get more from your data?
What is your goal?
26
Exercise 2 on Handling Missing Data – 15’
• Examine cars.txt dataset containing records for 261
automobiles manufactured in 1970s and 1980s
• Examine the file and handle missing data
27
Exercise 3 – 15’
Handling Outliers
• Examine cars_full.txt dataset containing full records
for 261 automobiles manufactured in 1970s and
1980s
• Examine the file for outliers
28
Normalization
•Using the cars_full.txt dataset normalize the “time-to-
60” values
•Use the [0-1] scale
Exercise 4 on Data Analysis
• Using the cars_full.txt dataset
• Class will be divided into 3 groups, focusing on
• Mileage (mpg)
• Power (hp)
• Performance (time-to-60)
• Questions:
• Which are the more correlated variables? Explain
• What is the impact of “brand” (Country of origin)?
• Are there differences in normalized vs non-normalized
analyses (apply normalization to all the variables)?
• Each group will present their results
29