DSCC 201/401
Tools and Infrastructure for Data Science
March 15, 2021
• Brief history and overview
• R interfaces
• Language syntax and examples • Useful libraries
R
2
Objects in R
• Scalars and Characters
1
• Vectors
• Matrices
c(1,2,3)
matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
list(1,2,3,”hello”,sqrt)
• Data Frames (“table” or “heterogeneous matrix”) • Factors (“categorical data”)
• Functions (operations on objects)
• Lists (i.e. “heterogeneous vectors”)
3
Data Pre-Processing
• One of the most essential functions before data analysis can be performed
• Data pre-processing can be categorized into 4 main operations: • Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
4
Data Pre-Processing: Data Cleaning
• Data often needs to be “cleaned” before it can be used for a useful analysis
• Examples of how data can be identified as “dirty”
• Incomplete data – missing values or missing attributes
• Noisy data – contains obvious errors or many outliers
• Inconsistent data – contains discrepancies in codes and letters
5
Data Pre-Processing: Data Integration
• Data often needs to be combined or integrated from multiple data sources before an analysis can be performed
• Examples of how data integration can be performed:
• Combine data from different data sources into a common storage type, e.g. CSV (comma-separated values) file
• Perform a schema integration and add add data to common data structure for analysis
6
Data Pre-Processing: Data Transformation
• Data often needs to be transformed so it is consistent and in the appropriate format for analysis
• Examples of data transformation:
• Data Smoothing – Removing noise from the data
• Data Normalization – Scaling data to fit a specified range
7
Data Pre-Processing: Data Reduction
• Data can be reduced so it produces the same or similar analytical result
• Advantages include working with a smaller data set or the ability to use different algorithms for analysis
• Examples of data transformation:
• Data Aggregation – Only use the necessary features for the analysis from a very large data set
• Data Discretization – Combine ranges of numerical or labeled data into common sets (e.g. “binning”)
• Data Dimensionality Reduction – Decrease the number of variables needed to perform the analysis (e.g. principal component analysis)
8
• Summary Statistics
• Hypothesis Tests (t-Test) • Probability Distributions
Statistics with R
9
Hypothesis Tests (t-Test)
• One-sample location test whether the mean of a population has a value specified in a null hypothesis (t-Test)
• Two-sample location test of the null hypothesis that the two populations are equal (Student’s t-Test)
• Criteria that variances not be equal (Welch’s t-Test)
10
Hypothesis Tests
• t-Test is useful to compare one variable between two groups • Usually between a control group and experimental group
x ̄ μ t = s/pn
11
Example (t-Test)
• Quality control for Diet Coke
• Taste testers rate the sweetness before and after storage (“sweetness score” 1-10)
• Results from 10 testers (difference in score before-after storage):
2.0, 0.4, 0.7, 2.0, -0.4, 2.2, -1.3, 1.2, 1.1, 2.3
• Null Hypothesis: μ = 0
• Alternative Hypothesis: μ > 0
• Should we be concerned that the drink is losing sweetness?
12
Probability Distributions
• Probability Density Function
Z
P{X 2 B} = • Cumulative Distribution Function
Za 1
F(a) = P{X 2 ( 1,a]} =
f(x)dx
13
B
f(x)dx
• Normal
• Exponential • Log-Normal • Poisson
• Uniform •…
Probability Distributions
14
R Conventions for Distributions
• PDFs begin with d
• CDFs begin with p
• Random number generators begin with r • Quantile functions begin with q
15
• Functions
• Conditionals • Loops
R Control Structures and Functions
16
Functions
myfunction <- function(arg1, arg2, ...) {
statements
return(object)
}
17
if (condition) expr
if (condition) expr1 else expr2
Conditionals
18
while (i < 10) {
i <- i + 1
}
for (i in 1:9) {
print(i)
}
Loops
19
Functional Programming
• Apply a function to a set of data
• For example, apply a function to a vector of numerical data
• sapply
• Functional programming can be done in parallel with mclapply
20