DSME5110F: Statistical Analysis
Lecture 2 Descriptive Statistics: Tabular and Graphic Methods
Outline • Classification of Data
Copyright By PowCoder代写 加微信 powcoder
• Univariate
– Presenting Qualitative Variables – Presenting Quantitative Variables
• Bivariate Relationships – Cross Tabulations
– Scatter Plots
Classification of Data
Classification of Data: Qualitative or Quantitative?
• Qualitative data are measurements that can be categorized into one of several classifications. Qualitative data are typically labels or names used to identify a quality of each element
– Sometimes they are numerically represented, but the numbers are labels and we cannot perform quantitative operations on them in the usual ways.
– Nominal-scaled data indicate into which category a data element should be placed (in R, it is called a factor)
– Ordinal-scaled data have all the properties of nominal-scaled data but in addition have the quality that order is meaningful (in R, it is an ordered factor)
• Quantitative data are those that can be characterized as metric rather than nominal; they are represented in terms of how many or how much
– Interval-scaled data express intervals between values in terms of fixed units of measurement: for example, temperature
– Ratio-scaled data have the properties of interval-scaled data, and if the ratio of two values is meaningful: for example, distance, height, weight
• the quality that zero indicates the absence of whatever is being measured
Qualitative Data in R • FactorsandLevels
– A factor is a special type of vector, normally used to hold categorical or ordinal variables.
– Why not use character vectors?
• Category labels are stored only once:
– Character: store Pen, Pencil, Pen
– Factor: store 1, 2, 1 (hence reduce the size of memory)
• Many algorithms treat nominal and numeric data differently – Coding as factors is often needed to inform an R function to treat
categorical data appropriately
– The levels variable comprise the set of possible categories factor could take (e.g., Pen or Pencil)
Factors and Levels
• Factors and levels: Create a factor by factor(vector) – X <- c(”AB", "A", "B", "A")
– (Xf <- factor(X))
– Levelsareorderedalphabeticallybydefault;
– str(Xf) # Internally factors are represented by integers 1, 2, 3, ...
– levels(Xf)
– table(Xf) # table(X) also works
• One can anticipate future new levels, one cannot sneak in an “illegal” level
– (Xf <- factor(X, levels = c("A", "B", "AB", "O"))))
– Xf[2] <- "O" # works
– Xf[2] <- "C" # warning
Ordinal Data (Ordered Factors)
• Indicate the presence of ordinal data by
setting the ordered parameter to TRUE • Compare:
– (quality1 <- factor(c("Average", "Good", "Good", "Bad"), levels = c("Bad", "Average", "Good")))
– (quality2 <- factor(c("Average", "Good", "Good", "Bad"), levels = c("Bad", "Average", "Good"), ordered = TRUE))
• Logical tests work for ordered factors – quality1 > “Average” # error
– quality2 > “Average” # works
Differences Between Nominal and Numerical Data
– xNum <- c(1, 2, 3, 1)
– yNum <- as.factor(xNum) – xNum[2]-xNum[1]
– yNum[2]-yNum[1]
– summary(xNum)
– summary(yNum)
• Identifying the data type may seem easy, but some could be either qualitative or quantitative depending on how you define them:
– Age seems to be obvious quantitative;
– How about "children", "young adults", "middle-aged persons" and "retirement-age people"?
Descriptive Statistics Depend on ...
Outline • Classification of Data
• Univariate
– Presenting Qualitative Variables – Presenting Quantitative Variables
• Bivariate Relationships – Cross Tabulations
– Scatter Plots
Frequencies and Relative Frequencies
• Qualitative data aren’t inherently numerical, so we can’t perform numerical analysis on them
• Frequency distribution – tabular summary of the number of observations falling into each of 2 or more mutually- exclusive, collectively-exhaustive categories
• Relative frequency distribution – reports the proportion, rather than number, of observations in each category
Example 2.1: Countries of Origin of Students
• Suppose that an MBA class is composed of 3 students from Latin America, 4 from Europe, 8 from US, 10 from India, and 15 from China. Enter data into a vector with the following line:
> countries <- rep(c('Latin', 'Europe', 'US', 'India', 'China'), c(3, 4, 8, 10, 15))
> countries<-factor(countries)
• In the above, rep() means ‘repeat’. The first vector consists of strings to be repeated and the second vector consists of the number of times each string should be repeated.
• To examine the first few elements of the data set, enter either or both of the lines below:
> head(countries,10)#displaythefirst10valuesofcountriesdata > tail(countries,10)#displaythelast10valuesofcountriesdata
Tabulate the Countries Data
• To calculate frequency distribution (table()) and relative frequency distribution (prop.table()), enter the following lines in R:
• Use sort() to sort elements by frequency; otherwise they will be ordered alphabetically
> freq <- table(countries) # Frequency count of each country
> freq2 <- sort(freq, decreasing = TRUE) # sort in decreasing order of
> relative.freq <- prop.table(freq) # Percentage frequency of each country
> relative.freq2 <- sort(relative.freq, decreasing = TRUE) # sort in decreasing order of relative frequency
> countries_table <- cbind(freq2, relative.freq2)
Bar Plots (or Bar Graphs)
• Bar plot – typically bars on horizontal axis represent different categories and heights of the bars indicate the frequency, relative frequency, or percent frequency for each category
– Usebarplot()functiontogeneratebarplot – Specifycoloursofcolumnswithcol=c()
– Horizontalaxislabelisxlab=
– Verticalaxislabelisylab=
– Rangeofvaluesassumedbyverticalaccessisylim=c()
– Titleofgraphismain=
• To plot both the frequency and the relative frequency distributions calculated previously, enter the following lines:
> barplot(freq2, col = rainbow(5), main = ‘Frequency Distribution’, ylab = ‘Frequency’)
> barplot(relative.freq2, col = rainbow(5), main = ‘Relative Frequency Distribution’, ylab = ‘Relative Frequency’)
Example 2.1 (continued): Bar Plots
Note that the only difference between these two plots is the scale of y-axis.
Pie Charts
• We can also use a pie chart to display the frequency distribution or relative frequency. Enter the following line:
> pie(freq, main = ‘Frequency Distribution\nCountries of Origin’)
> pie(freq, main = ‘Frequency Distribution\nCountries of Origin’, labels = paste(names(freq),
‘:’, freq))
– In the above \n is ‘carriage return’ for inserting a new line.
Presenting Qualitative Variables: Summary
• Frequency distribution (table()) and Relative frequency distribution (prop.table())
• Visualization
– Bar plot (bar graph): barplot() – Pie chart: pie()
Outline • Classification of Data
• Univariate
– Presenting Qualitative Variables – Presenting Quantitative Variables
• Bivariate Relationships – Cross Tabulations
– Scatter Plots
Example 2.2: Hours of TV Viewing
• In this example, we will use the TV viewing data (tv_hours_data.csv) to demonstrate how to present numerical variables.
• Importing Data
> tv_hours<-read.csv('./Data/tv_hours_data.csv')
> str(tv_hours)#togetanoverviewofthedatastructure > head(tv_hours,10)#toviewthefirst10observations
• Despite being presented in ascending order, one cannot gain much insight into television-viewing patterns by inspecting the data in their present form.
Binning and Tabulation
• It is a way to group a number of continuous values into a smaller number of “bins” (categories). Once binning is done, we can then count the frequency of each bin.
– requireschoosingnumberandwidthofclasses
– Use cut() to break the variable into different bins
– The relative frequency distribution can be calculated then – Binning and Tabulation
> tv_hours$bin <- cut(tv_hours$hours, seq(0, 25, 5)) # classifying viewing time into one of the 5 categories
> tv_hours.freq <- table(tv_hours$bin) # counting the frequency of each bin
> tv_hours.rel.freq <- prop.table(tv_hours.freq) # converting freq into relative freq > tv_hours.table <- cbind(tv_hours.freq, tv_hours.rel.freq)
> barplot(tv_hours.rel.freq)
• Histogram – partitions data into classes on the horizontal axis while the heights of the bars above the axis indicate the frequency for each class
– Use hist ( ) function to generate histogram
– Specify number of categories with breaks=
– Horizontal axis label is xlab=
– Range of values assumed by vertical access is ylim=c() – Title of graph is main=
– Specify colours with col=c()
> hist(tv_hours$hours, main = ‘Hours of TV Viewing’, xlab = ‘Hours’) # histogram
> hist(tv_hours$hours, breaks = 5, main = ‘Hours of TV Viewing’, xlab = ‘Hours’) # histogram with suggested bins = 5
> hist(tv_hours$hours, breaks = 5, probability = TRUE, main = ‘Hours of TV Viewing’, xlab = ‘Hours’) # showing ‘density’ in y-axis
with automatic binning
Example 2.2: The Results frequency
bins = 5, frequency
relative frequency
automatic binning,
Presenting Quantitative Variables: Summary
• Binning and Tabulation
– QuantitativeQualitative
– Frequency (table()) and relative frequency (prop.table())
– Bar plot (barplot()), pie chart (pie())
• Histogram (hist()) – breaks = 5
– probability = TRUE?
Outline • Classification of Data
• Univariate
– Presenting Qualitative Variables – Presenting Quantitative Variables
• Bivariate Relationships – Cross Tabulations
– Scatter Plots
Bivariate Relationships
• Very often, we would like to be able to capture something about the relationship between two variables taken together.
• Two methods:
– Tabular: cross-tabulation – Graphical: scatter plot
Outline • Classification of Data
• Univariate
– Presenting Qualitative Variables – Presenting Quantitative Variables
• Bivariate Relationships – Cross Tabulations
– Scatter Plots
Cross-Tabulations
• Cross-tabulation (contingency) table – tabular method for summarizing the relationship between two nominal variables simultaneously.
• A flexible approach that can be used when (change quantitative to qualitative)
– bothvariablesarequalitative,
– botharequantitative,or
– whenoneisqualitativeandoneisquantitative
• Cross-tabulations in R
– Usetable()tocreatecross-tabulation
• Use function names() to identify the variable names
• Use cut() to break a quantitative variable into several classes
• Can make improvements by:
– Including the column and row totals using rowSums(), cbind(), colSums(), and rbind() – Rename the columns and rows using rownames() and colnames()
– JointRelativeFrequencies:prop.table()
Example 2.3: Diamonds
• In this example, we will use the diamonds data in
ggplot2 package (which is part of tidyverse package). > library(tidyverse) # You may have to install it first.
> str(diamonds) # to get an overview of the data set
• It appears that there are 3 categorical variables (data type = ordinal factor) in the data set:
– color – clarity
• Enter help(diamonds) or ?diamonds to see the description of the data set and the definition of each variable.
Example 2.3 (Continued): Cross Tabulations
• Joint Frequencies
> table(diamonds$cut, diamonds$color) # Joint freq of cut and color (bottom left table)
> table(diamonds$cut, diamonds$color, diamonds$clarity) # Joint freq of cut, color, and clarity
• Joint Relative Frequencies
> prop.table(table(diamonds$cut, diamonds$color)) # Joint relative freq of color and cut (i.e., pct of overall total)
> prop.table(table(diamonds$cut, diamonds$color), 1) # Relative freq of color given cut (i.e., pct of row total)
> prop.table(table(diamonds$cut, diamonds$color), 2) # Relative freq of cut given color (i.e., pct of column total)
A More Advanced Function
• R command:
– Use CrossTable() function in the gmodels package > install.packages(“gmodels”)
> library(gmodels)
> CrossTable(x = diamonds$cut, y = diamonds$color)
Example 2.4: Student Satisfaction Rating
• In this example, we will use the student satisfaction rating data (students_data.csv) to show how we can compare two variables with a cross tabulation
R Codes for Example 2.4
• Importing data and preliminary work
> students <- read.csv('./Data/students_data.csv')
> str(students) # viewing the overall structure of the data
> students$rating.cat <- cut(students$rating, c(0, 2, 4, 6, 8)) # classing rating data into 4 bins and assigning the result to a new variable
> students$year <- factor(students$year,
levels = c('freshmen', 'sophomores', 'juniors', 'seniors’),
labels = c('Freshmen', 'Sophomores', 'Juniors', 'Seniors’))
# converting year into factor with defined levels and labels to ensure study year appears in proper order
• Cross tabulation
> students.tbl <- table(students$year, students$rating.cat)
> students.tbl # This table has no total rows and columns
> students.tbl2 <- cbind(students.tbl, Total = rowSums(students.tbl)) # adding row totals as the last column
> students.tbl3 <- rbind(students.tbl2, Total = colSums(students.tbl2)) # adding col totals as the last row
> students.tbl3 # This table includes total rows and columns
R Codes for Example 2.4 • Showing bins with text labels
> students$rating.lab <- factor(students$rating.cat, labels = c('Poor', 'Below Avg', 'Above Avg', 'Excellent'))
> students.tbl4 <- table(students$year, students$rating.lab)
> students.tbl4 # This table has no total rows and columns
> students.tbl5 <- cbind(students.tbl4, Total = rowSums(students.tbl4))
> students.tbl5
> students.tbl6 <- rbind(students.tbl5, Total = colSums(students.tbl5))
> students.tbl6 # This table includes total rows and columns
Simpson’s Paradox
• Data in two or more crosstabulations are often aggregated to produce a summary cross tabulation;
• We must be careful in drawing conclusions about the relationship between the two variables in the aggregated cross tabulation;
• In some cases the conclusions based upon an aggregated cross tabulation can be completely reversed if we look at the unaggregated data;
• The reversal of conclusions based on aggregate and unaggregated data is called Simpson’s Paradox
Puzzling Statistics: Example
• The following table shows the batting averages of two “switching hitters” in 1991, (LA Dodgers) and (Pitts. Pirates). Who was the more valuable player, with respect to batting statistics, in 1991? How could Murray beat Merced both as a left-handed and a right- handed batter and still have a lower batting average?
– Batting Average = no. of hits divided by the number of plate appearances — or at bats.
– Eddie is 35/100 when hitting with left-hand and 15/75 with right-hand. Hence, his overall hitting average is 50/175.
– Orlando is 34/100 when hitting with left-hand and 7/40 with right-hand. Thus, his overall hitting average is 41/140.
Batting Average
Right-hand
Why Simpson’s Paradox?
• 0.35>0.34;
• 0.20>0.175;
• Is this true:
–forany0≤𝑞𝑞 ≤1and0≤𝑞𝑞 ≤1(notethat𝑞𝑞
121 and 𝑞𝑞2 can be different),
0.35𝑞𝑞1 +0.2 1−𝑞𝑞1 ≥ 0.34𝑞𝑞2 +0.175 1−𝑞𝑞2 ?
– In this example, 𝑞𝑞1=100/175 and 𝑞𝑞2=100/140
Outline • Classification of Data
• Univariate
– Presenting Qualitative Variables – Presenting Quantitative Variables
• Bivariate Relationships – Cross Tabulations
– Scatter Plots
Scatter Plots
• Scatter plot (or scatter diagram) – a graphical presentation of the association between two quantitative variables.
– One variable is shown on the horizontal axis, and
– the other on the vertical axis.
– The general pattern of the plotted points suggests the nature of the relationship between the variables
– plot(x = Vec1, y= Vec2, main=”Title”, xlab = “X Lab”, ylab=”Y Lab”)
• Caution: Any relationships indicated by the plot do not mean that either variable has caused the other! By themselves, statistical methods do not establish causality
Example 2.5: Poverty Data
• In this example, we will use the poverty data (poverty_data.csv) to learn how to examine the relationships between a pair of data with scatter plot.
• Enter and execute the following lines to import data:
> poverty <- read.csv('./Data/poverty_data.csv') > str(poverty) # to view the data structure
Scatter Plots of Example 2.5 • A Single Scatter Plot: Poverty vs. Obesity
> plot(poverty$Poverty, poverty$Obesity, xlab = ‘Percent in Poverty’, ylab = ‘Percent Designated Obese’)
# See the chart on the bottom. Try other pairs of variables yourself.
• A Matrix of Scatter Plots
> plot(poverty[,2:5]) # generating scatter plots for all pairs of numerical variables
R Formula Syntax
• a~b:explainawithb
• a~.: explain a with all available variables
• a~.-b: explain a with all available variables except b
Matrix of Scatter Plots
How are each pair of variables related? Can you try to give some plausible explanations regarding why the variables are related in the observed ways?
Conditioning Plots
• Instead of making a scatter plot of one variable against another variable for the entire data set, we can also make scatter plot by levels of a categorical variable.
> coplot(lake_city$Price~lake_city$SqFt | lake_city$Brick)
Example 2.6: House Prices in Lake City.
• Let us use the dataset ‘lake_city.csv’, which contains data on 128 recently sold houses in Lake City.
• First, import the data:
> lake_city <- read.csv('./Data/lake_city.csv')
# Depending on where you save this file, your path may be different.
> str(lake_city) # viewing the overall structure
• Scatter plot of Price again SqFt using the entire data set:
> plot(lake_city$Price, lake_city$SqFt)
• Scatter plot of Price against SqFt by house type (brick or non-brick) > coplot(lake_city$Price~lake_city$SqFt | lake_city$Brick)
• Scatter plot of Price against SqFt by neighborhood > coplot(lake_city$Price~lake_city$SqFt | lake_city$Nbhd)
Conditional Scatter Plots of Example 2.6
• When there are more than two factor levels, the two bottom-most graphs correspond to the two left-most factor levels.
• The two graphs above the bottom row correspond to the next two factor levels, and so on.
Bivariate Relationships: Summary
• Cross Tabulations – table()
– prop.table()
– CrossTable() function in the gmodels package
• Scatter Plots – plot()
– coplot()
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com