DSME5110F: Statistical Analysis
Lecture 3 Descriptive Statistics: Numerical Methods
• Measures of Central Tendency • Measures of Location
Copyright By PowCoder代写 加微信 powcoder
• Box Plots
• Measures of Variability
• The z-score: A Measure of Relative Location • Measure of Association: the Bivariate Case
Measures of Central Tendency • Mean: the average of the values in a dataset
– Population Mean: 𝜇𝜇 = , where “N” is the population size.
• “∑ 𝑥𝑥 ” means “summation of all 𝑥𝑥 ’s” i𝑁𝑁i
– Sample Mean: 𝑥𝑥̅ = ∑ 𝑥𝑥i, where “n” is the sample size
– R Function: mean() 𝑛𝑛
• Median: the middle value when all numbers are sorted in either ascending or descending order.
– Depending on the number of observations:
• Odd — middle value;
• Even – the mean of the two middle values
– R Function: median()
• Why two measures of central tendency?
– They are affected differently by the values of falling at the far ends of the range
– The mean is highly sensitive to outliers (values that are atypically high or low in relation to the majority of data)
Example 3.1: House Prices of Lake City
• We will use the Lake City housing market data discussed in Example 2.6 of the previous lecture.
• For the houses included in the data set, we want to calculate the followings:
– the overall mean price
– the mean price of each neighborhood
– the median price for brick and non-brick houses, respectively
this file, your path may be different.
> str(lake_city) # viewing the overall structure
• First, import the data:
> lake_city <- read.csv(‘./Data/lake_city.csv') # Depending whether you save
Example 3.1: House Prices of Lake City
• Calculating the overall mean is simple. Just execute the following R codes
> mean <- mean(lake_city$Price) # calculating the mean
• Calculating the mean for each neighborhood is slightly more complicated. You need to filter the data for each neighborhood before applying the function. Here is how you can do to calculate the mean for each neighborhood one-by- one.
> mean(lake_city$Price[lake_city$Nbhd==’A’]) > mean(lake_city$Price[lake_city$Nbhd==’B’]) > mean(lake_city$Price[lake_city$Nbhd==’C’])
Example 3.1: House Prices of Lake City
The method used in the previous slide can be quite tedious when the categorical variable has many levels.
The task can be made easier by using the tapply() function as follows:
> attach(lake_city) # attach the data so that its column names are in the search path
> tapply(Price, Nbhd, mean) # calculating the mean price of each neighborhood
> tapply(Price, Brick, mean) # calculating the mean price of brick and non-brick houses
> tapply(Price, list(Nbhd, Brick), mean) # calculating the mean price for each combination of Nbhd and Brick levels
> detach(lake_city)
On average, which is the most expensive neighborhood?
Is brick house more expensive than non-brick house on average?
Which combined level of neighborhood and brick is the most expensive on average?
• Measures of Central Tendency • Measures of Location
• Box Plots
• Measures of Variability
• The z-score: A Measure of Relative Location • Measure of Association: the Bivariate Case
Measures of Location: Percentile and Quartile
• Percentile: The 𝑝𝑝-th percentile of a data set is a value such that at least 𝑝𝑝% of data values are less than or equal to it and at least (100 − 𝑝𝑝)% are greater than or equal to it;
– Provideinsightintohowdataaredistributedovertheirentirerange – Rfunction:quantile(data,c())
• Quartile: special cases of percentile that divide the data into four groups, each with (approximately) a quarter of all observations.
– There are three quartiles, denoted as 𝑄𝑄 , 𝑄𝑄 , and 𝑄𝑄 , corresponding
to 25th, 50th (median), and 75th percentiles, respectively. – Rfunction:quantile()
• Inter-quartile range: IQR = 𝑄𝑄3 − 𝑄𝑄1 – Rfunction:IQR()
• Measures of Central Tendency • Measures of Location
• Box Plots
• Measures of Variability
• The z-score: A Measure of Relative Location • Measure of Association: the Bivariate Case
• Box plot (or Box-Whisker Plot) – a common visualization of the five-
number summary
– 5 values: the minimum, maximum, 𝑄𝑄 (1st quartile), median (dark
line), 𝑄𝑄3 (3rd quartile)
– Thelinesformingtheboxrepresent𝑄𝑄1,median(darkline),𝑄𝑄3
– The minimum and maximum values are illustrated using the whiskers that extend below and above the box
• A widely used convention: the whiskers are extended to a minimum or maximum of 1.5 times the IQR below 𝑄𝑄1 and or above 𝑄𝑄3
• Any values that fall beyond this threshold are considered as outliers and are denoted as circles or dots
• Allows one to quickly obtain a sense of the range
• When two or more box-plots are displayed side-by-side, we can
also compare the difference in the distribution of the data sets.
• Very useful when we want to show the data distribution graphically.
Example 3.1: House Prices of Lake City • R function: boxplot()
> boxplot(lake_city$Price)
• We can also make boxplots for the interaction of two or more
– To make side-by-side boxplots for the price distributions of the three neighborhoods (A, B, and C), use the following line:
> boxplot(lake_city$Price~lake_city$Nbhd)
– To make side-by-side boxplots for the price distributions of brick and non-
categorical variables.
brick houses, use the following line:
> boxplot(lake_city$Price~lake_city$Brick)
– To make side-by-side boxplots for the price distributions of all levels of
Interaction between Nbhd and Brick, use the following line:
> boxplot(lake_city$Price~lake_city$Nbhd + lake_city$Brick) • Check outliers
> boxplot(lake_city$Price)$out
R Formula Syntax
• a~b:explainawithb
• a~.: explain a with all available variables
• a~.-b: explain a with all available variables except b
Example 3.1: Boxplots
• Based on the boxplots below, what type of houses seem to be more expensive?
Tentative Summary
• Central Tendency – Mean: mean()
– Median: median()
• Location
– Percentile and Quartile: quantile()
• Box Plot: boxplot()
• Measures of Central Tendency • Measures of Location
• Box Plots
• Measures of Variability
• The z-score: A Measure of Relative Location • Measure of Association: the Bivariate Case
Measures of Dispersion • Range and Inter-quartile Range (IQR)
– Population Variance: 𝜎𝜎 = ∑(𝑥𝑥 − 𝜇𝜇) /𝑁𝑁
• R doesn’t have a built-in function for population variance. But we can easily calculate it
• Variance: measures “the average squared-distance to the mean”. 22
by using R function for sample variance, var().
• F function for population variance = var() * (N – 1)/N
– SampleVariance: 𝑠𝑠2 =∑ 𝑥𝑥−𝑥𝑥̅ 2/(𝑛𝑛−1) • R Function: var()
– Population Standard Deviation: 𝜎𝜎 = 𝜎𝜎
• R doesn’t have a built-in function for population standard deviation. But we can easily
• Standard Deviation: being the square-root of the variance
calculate it by using R function for sample standard deviation, sd().
• R function for population standard deviation = sd() * sqrt((𝑁𝑁 − 1)⁄𝑁𝑁)
– Sample Standard Deviation: 𝑠𝑠 = 𝑠𝑠2 • R Function: sd()
Calculating Statistics
• To calculate other statistics (e.g., median, sd, var, etc.), simply replace ‘mean’ by the name of the function that you want to use in all the examples discussed previously.
• We can also treat the function name as a variable. In the codes below, the names of all functions available for selection are assigned to a vector. Then, we can choose the number corresponding to the position of the function in the vector. After that, we can execute any of those lines that start with tapply below, depending on your task.
> stats <- c('mean', 'median', 'sd', 'var') # enter the function names into a vector
> i <- 3 # choose a number corresponding to the statistics above
> stats.name <- stats[i] # assign the function name to stats.name
> tapply(lake_city$Price, lake_city$Nbhd, stats.name)
> tapply(lake_city$Price, lake_city$Brick, stats.name)
> tapply(lake_city$Price, list(lake_city$Nbhd, lake_city$Brick), stats.name)
Calculating Multiple Statistics with summary()
• We can use summary() function to get a list of summary statistics at the same time.
> Nbhd_A <- summary(lake_city$Price[lake_city$Nbhd == 'A'])
> Nbhd_B <- summary(lake_city$Price[lake_city$Nbhd == 'B'])
> Nbhd_C <- summary(lake_city$Price[lake_city$Nbhd == 'C'])
> price.summary3 <- rbind(Nbhd_A, Nbhd_B, Nbhd_C) # combining rows
> rownames(price.summary3) <- c('Nbhd_A', 'Nbhd_B', 'Nbhd_C') # assigning row names
> price.summary3
• The result includes six key summary statistics.
Example 3.2: Mean-Variance Criterion for Choosing Investment Alternatives
• In financial and investment decisions, variance or standard deviation is often used to measure the risk of an alternative. The larger the variance of investment return, the higher the risk of the investment.
• Generally speaking, investors prefer alternatives with higher mean return and smaller variance of return.
• The estimated means and variances of returns of 10 investments are given in the file: investment.csv.
• Examine the data and decide which investments should NOT be chosen based on mean and variance criterion.
Example 3.2: The Result
• The mean return is plotted against the variance of return as shown in the graph. The R codes that generates the plot is shown on the bottom.
• An investment is dominated by all investments that lie to its south eastern corner. The reason is simple. If Point A lies to the south-eastern corner of Point B, then A must have a higher mean and lower variance than B. Hence, A is preferred to B.
• Based on this, except Investments H, I, G, J, and E, all other investments are dominated by at least one other investment and hence can be eliminated from consideration. For example, B is dominated by G. In fact, Investment I can also be eliminated by constructing a portfolio that consists of H and G.
Tentative Summary
• CentralTendency – Mean: mean()
– Median: median()
• Location
– Percentile and Quartile: quantile()
• Box Plot: boxplot()
• Variability
– Variance: var()
– Standard deviation: sd()
• summary()
• Measures of Central Tendency • Measures of Location
• Box Plots
• Measures of Variability
• The z-score: A Measure of Relative Location • Measure of Association: the Bivariate Case
Measure of Relative Location: z-Score
• z-score (standardized score) – indicates the relative position of a data value with respect to the data set of which it is a member
– calculated by using the form𝑖𝑖ula: 𝑖𝑖
𝑧𝑧 = 𝑥𝑥 𝑠𝑠– 𝑥𝑥 ̅
– measures “the number of standard deviations that a given value x is above or below the mean”.
• Regardless of the mean and variance of the original data set, after all values have been transformed into z, the mean and variance of z score are 0 and 1, respectively.
• R function: scale()
Empirical Rule
• For normally distributed (bell-shaped) data, approximately
– 68% of the data values lie within ± 1 standard deviation from the mean,
– 95% of the data values lie within ± 2 standard deviations from the mean, and
– 99.7% of the data values lie within ± 3 standard deviations from the mean.
• Can identify outliers of normal distributions by standardizing data and flagging any z-scores less than −3 or greater than +3
Example 3.3: Household Income
The file income.csv contains annual household income data collected from 1020 households in in 2012.
Answer the following questions:
1. What are the mean and standard deviation of household income?
2. 80% of household income are less than or equal to (≤) what amount?
3. 70% of household income are greater than or equal to (≥) what amount?
4. Convert all household income into standardized z-scores. What percentages of families have z-scores between -1 and 1, between -2 and 2, and between -3 and 3? Are the numbers consistent with the empirical rule mentioned earlier?
Chebyshev’s Theorem
• For any set of data values, at least 100 1 – 𝑘𝑘1 % of the values must lie
within ±𝑘𝑘 standard deviations from the mean, where 𝑘𝑘 is any value greater
– 𝑘𝑘 needs not be an integer.
– If 𝑘𝑘 ≤ 1, then 1 − 𝑘𝑘1 ≤ 0. Thus Chebyshev’s theorem requires 𝑘𝑘 > 1;
• The average height of male adult is 170 cm with a standard deviation of 7 cm.
– If it is known that our heights are approximately normally distributed. If so, what percentage of HK male adults have heights between 156 and 184?
– Without knowing the actual distribution of heights, at least what percentage of HK males have heights between 156 and 184?
• Measures of Central Tendency • Measures of Location
• Box Plots
• Measures of Variability
• The z-score: A Measure of Relative Location • Measure of Association: the Bivariate Case
Measure of Association: Covariance
• Covariance – a measure of linear relationship between two
numerical variables. ∑(𝑋𝑋 − 𝑋𝑋�)(𝑌𝑌 − 𝑌𝑌�) – The sample covariance of X and Y is:
𝐶𝐶𝐶𝐶𝐶𝐶 𝑋𝑋,𝑌𝑌 =
– R function: cov(X, Y)
– Positive value = positive linear relationship
– Negative value = negative linear relationship
– 0 = no linear relationship between the variables
• The limitation of covariance as a descriptive measure is that it is affected by the units in which the two variables are measured. As such, covariance can only measure the direction, but not the strength, of a linear relationship between two numerical variables.
Measure of Association: Correlation Coefficient
• Correlation coefficient – numerical measure of linear association between two numerical variables that indicates not only whether the association is positive or negative, but also how strong that association might be
• The sample correlation coefficient is defined as:
𝐶𝐶𝐶𝐶𝐶𝐶𝑋𝑋,𝑌𝑌 =
• s is the standard deviation of 𝑋𝑋, and s is the standard deviation of 𝑌𝑌 𝑋𝑋𝑌𝑌
𝐶𝐶𝐶𝐶𝐶𝐶(𝑋𝑋, 𝑌𝑌)
• R function: cor(X, Y)
• It is independent of the measurement units used.
• It is always between -1 and +1
– With -1 indicating perfect negative relationship – With +1 indicating perfect positive relationship – With 0 indicating no linear relationship.
• The correlation coefficient between a pair of variables, X and Y, is often used together with the scatter plot of X against Y to help understand the relationship between X and Y.
Cohen’s Rules of Thumb
• For correlations between variables describing people (|r| is the absolute value of r, the correlation coefficient)
– |r| between 0.1 and 0.3 should be considered a small or weak association
– |r|between0.3and0.5mightbeconsideredmediuminstrength
– |r|above0.5orhighercouldbeconsideredlargeorstrong • Note:
– Under the assumption that the variables are normally distributed (or are approximately so)
– If the variables are not normal, then it can be helpful to transform the variables to normal distributions before interpreting
– The above thresholds may be too lax for some purposes: the correlations must be interpreted in context. For data involving human beings, a correlation of 0.5 may be considered extremely high, while for data generated by mechanical processes, a correlation of 0.5 may be weak.
Example 3.4: Poll
• In this example, we will use the polling data. The data set contains information on 16 voters about the following:
– 𝑥𝑥 : their age 1
– 𝑥𝑥2: annual income,
– 𝑥𝑥3: view of same-sex marriage,
– 𝑥𝑥4: average amount of weekly television viewing (in hours)
– 𝑥𝑥5: average amount of weekly television viewing (in minutes).
– the covariance between each possible pair of variables from 𝑥𝑥 to 𝑥𝑥 .
– the correlation coefficient between each possible pair of variables from 𝑥𝑥1 to 𝑥𝑥5.
– the scatter plot between each possible pair of variables from 𝑥𝑥1 to 𝑥𝑥5. • Discuss the relationships between all possible pairs of variables.
Example 3.4: Poll
x1 is positively related to x2 and negatively related to x3.
x2 is negatively related to x3.
x4 and x5 have a perfect positive linear
relationship (of course).
The relationships for all other pairs are
insignificant.
Correlation Matrices
Note on Interpreting Correlation Coefficient
• Here are a few things to remember when using correlation coefficient to interpret the relation between two variables.
– Correlation coefficient can only be used to measure linear (straight line) relationship. It cannot measure non-linear relationship properly.
– Correlation does not imply causal relationship. High correlation between A and B does not mean “A causes
B” or “B causes A”.
– Correlation does
relation”. High correlation between A and B may be caused by a third unknown factor.
not necessarily mean “direct 36
Puzzling Statistics: Example
• One study published in a prominent medical journal (name of the article not cited) showed a strong positive correlation between per capita consumption of tobacco and the incidence of lung cancer over a number of countries. The author then concluded that
– “Smoking causes cancer”.
• Another researcher used the same data on per capita consumption of tobacco for the same countries but substituted the incidence rate of cholera. He obtained a negative correlation that was stronger than the positive correlation revealed in the first paper. This author then concluded that
– “Smoking prevents cholera”.
• He sent his paper to the journal published the first paper and the paper
was rejected.
Reference: , How to lie with Statistics, W. W. Norton & Company Inc. , 1954.
• Central Tendency – Mean: mean()
– Median: median()
• Location
– Percentile and Quartile: quantile()
• Box Plot: boxplot()
• Variability
– Variance: var()
– Standard deviation: sd()
• summary()
• The z-score: scale()
• Association: cov(), cor(), plot(), corrplot.mixed()
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com