CS代考 DSME5110F: Statistical Analysis

DSME5110F: Statistical Analysis

• An Introduction to Statistics
• Basic Terminology

Copyright By PowCoder代写 加微信 powcoder

• An Introduction to R – Vector
– Data Frame
– Import and Export Data

Puzzling Statistics: Example 1
• The following table shows the batting averages of two “switching hitters” in 1991, (LA Dodgers) and (Pitts. Pirates). Who was the more valuable player, with respect to batting statistics, in 1991?
– Batting Average = no. of hits divided by the number of plate appearances — or at bats.
– Eddie is 35/100 when hitting with left-hand and 15/75 with right-hand. Hence, his overall hitting average is 50/175.
– Orlando is 34/100 when hitting with left-hand and 7/40 with right-hand. Thus, his overall hitting average is 41/140.
Batting Average

Right-hand
• How could Murray beat Merced both as a left-handed and a right-handed batter and still have a lower batting average?

Puzzling Statistics: Example 2
• One study published in a prominent medical journal (name of the article not cited) showed a strong positive correlation between per capita consumption of tobacco and the incidence of lung cancer over a number of countries. The author then concluded that
– “Smoking causes cancer”.
• Another researcher used the same data on per capita consumption of tobacco for the same countries but substituted the incidence rate of cholera. He obtained a negative correlation that was stronger than the positive correlation revealed in the first paper. This author then concluded that
– “Smoking prevents cholera”.
• He sent his paper to the journal published the first paper and the paper
was rejected.
Reference: , How to lie with Statistics, W. W. Norton & Company Inc. , 1954.

Basic Terminology
• A population consists of all the elements of interest in a study that have some quality (or qualities) in common
– All students enrolled in a class
– All potential voters in a presidential election
– Population parameters are the characteristics of interest in a study. They are constants (but usually unknown).
• A sample is a subset of the population, often randomly chosen and preferably representative of the population as a whole
– Randomly select 3 students
– Opinion polls conducted by various institutions such as Gallup and Harris polls.
– Sample statistics are the characteristics of interest derived from a sample rather than a population. They are random variables (the values vary from sample to sample.).

Probability and Statistics
• When all the members of a population are known, we can calculate the probability of getting a particular sample.
– If a single card is drawn from a deck of 52 cards, what is the probability it will be a two?
– If you roll a six-sided die, what is the probability you will roll an even number?
– Interestingly, sometimes, even when we don’t know much about the population, we can still have very good idea about the probability of getting a particular sample statistic (central limit theorem, will learn).
• On the other hand, if we are using sample statistics to draw conclusions about unknown population’s parameters, we are solving a statistical problem.
Population
Probability
Statistics

Spreadsheet Format

Lectures 2 & 3
• ClassificationofData • Univariate
– Presenting Qualitative Variables
– Presenting Quantitative Variables
• BivariateRelationships – Cross Tabulations
– Scatter Plots
• Measures of Central Tendency
• Measures of Location
• Box Plots
• Measures of Variability
• The z-score: A Measure of Relative Location
• Measure of Association: the Bivariate Case

Univariate
– QuantitativeQualitative
– Frequency (table()) and relative frequency (prop.table())
– Bar plot (barplot()), pie chart (pie()) • Quantitative:
– Histogram (hist())
• Qualitative

Univariate
automatic binning, frequency
bins = 5, frequency
bins = 5, relative frequency

Bivariate Relationships
• Cross Tabulations – table()
– prop.table()
– CrossTable() function in the gmodels package
• Scatter Plots – plot()

Puzzling Statistics: Example
• The following table shows the batting averages of two “switching hitters” in 1991, (LA Dodgers) and (Pitts. Pirates). Who was the more valuable player, with respect to batting statistics, in 1991? How could Murray beat Merced both as a left-handed and a right- handed batter and still have a lower batting average?
– Batting Average = no. of hits divided by the number of plate appearances — or at bats.
– Eddie is 35/100 when hitting with left-hand and 15/75 with right-hand. Hence, his overall hitting average is 50/175.
– Orlando is 34/100 when hitting with left-hand and 7/40 with right-hand. Thus, his overall hitting average is 41/140.
Batting Average

Right-hand

Why Simpson’s Paradox?
• 0.35>0.34;
• 0.20>0.175;
• Is this true:
–forany0≤𝑞𝑞 ≤1and0≤𝑞𝑞 ≤1(notethat𝑞𝑞
121 and 𝑞𝑞2 can be different),
0.35𝑞𝑞1 +0.2 1−𝑞𝑞1 ≥ 0.34𝑞𝑞2 +0.175 1−𝑞𝑞2 ?
– In this example, 𝑞𝑞1=100/175 and 𝑞𝑞2=100/140

Empirical Rule
• For normally distributed (bell-shaped) data, approximately
– 68% of the data values lie within ± 1 standard deviation from the mean,
– 95% of the data values lie within ± 2 standard deviations from the mean, and
– 99.7% of the data values lie within ± 3 standard deviations from the mean.
• Can identify outliers of normal distributions by standardizing data and flagging any z-scores less than −3 or greater than +3

Correlation Matrices

Puzzling Statistics: Example
• One study published in a prominent medical journal (name of the article not cited) showed a strong positive correlation between per capita consumption of tobacco and the incidence of lung cancer over a number of countries. The author then concluded that
– “Smoking causes cancer”.
• Another researcher used the same data on per capita consumption of tobacco for the same countries but substituted the incidence rate of cholera. He obtained a negative correlation that was stronger than the positive correlation revealed in the first paper. This author then concluded that
– “Smoking prevents cholera”.
• He sent his paper to the journal published the first paper and the paper
was rejected.
Reference: , How to lie with Statistics, W. W. Norton & Company Inc. , 1954.

Note on Interpreting Correlation Coefficient
• Here are a few things to remember when using correlation coefficient to interpret the relation between two variables.
– Correlation coefficient can only be used to measure linear (straight line) relationship. It cannot measure non-linear relationship properly.
– Correlation does not imply causal relationship. High correlation between A and B does not mean “A causes
B” or “B causes A”.
– Correlation does
relation”. High correlation between A and B may be caused by a third unknown factor.
not necessarily mean “direct 19

Lectures 4 & 5
• Counting Rules
• Probabilities
• Probability Rules
• Conditional Probability
• AssociationRule – Support
– Confidence
• Bayes’ Theorem
– The Monty Hall Problem – Naïve Bayes Classifier

Example 4.3: Probabilities of Winning Mark 6 Prizes
Mark Six (Chinese: 六合彩) is a lottery game organized by the Jockey Club. The player selects 6 out of 49 numbers from 1 to 49. The winning numbers, which include 6 numbers plus one extra number, are selected automatically from a lottery machine. The winning probabilities for the prizes are given in the table below.
1 2,330,636
Prize Criteria
1st All 6 drawn numbers
2nd 5 out of 6 drawn numbers, plus the extra number
3rd 5 out of 6 drawn numbers
4th 4 out of 6 drawn numbers, plus the extra number 5th 4 out of 6 drawn numbers
6th 3 out of 6 drawn numbers, plus the extra number 7th 3 out of 6 drawn numbers
49 = 13,983,816
Probability
= 55,491.33
5 491 6 642 4 491 6 642 4 492 6 642 3 492 6642
= 22,196.53
= 1,082.76

Association Rules: Frequently Bought Together

Apriori Algorithm 1. User sets a minimum support criterion
For 𝑝𝑝 products…
Next, generate list of one-itemsets that meet the support criterion – Frequencyofthetransactionsincludingtheitem
Use the list of one-itemsets to generate list of two-itemsets that meet the support criterion
– If a certain one-itemset did not exceed the minimum support, any larger size itemset that includes it will not exceed the minimum support either
Continue for 𝑘𝑘 = 1,2,⋯,𝑝𝑝
Use list of two-itemsets to generate list of three-itemsets
– In general, generating 𝑘𝑘-itemsets uses the frequent (𝑘𝑘 − 1)-itemsets that were generated in the preceding step

Mathematical Representations
• Support of an itemset:
support(X) = count X
– Here N is the number of transactions in the database and
count X is the number of transactions containing itemset X. – The support of a rule (XY) is the support of {X, Y}
• Confidence of a rule:
confidence(XY)= support(X,Y)/support(X)
• Lift ratio of a rule
lift(XY) = confidence(XY)/support(Y)

The Monty Hall Problem
• One box above has a coin, and you need to guess which box it is • Assume
– youchooseBox1;
– Icanopenoneoftheothertwoboxes(IcannotopenBox1);
– IopenBox2andthereisnocoin;
– Youhavethechancetoswap.
• Question: Do you want to stay with Box 1 or change to Box 3?
• Consider two scenarios:
– Idonotknowwhichboxcontainsthecoin – Iknowwhichboxcontainsthecoin

Motivating Example

Pr(𝑌𝑌 = 𝑘𝑘|𝑋𝑋 = 𝑥𝑥) =
P 𝑌𝑌 = 𝑘𝑘 ∏𝑝𝑝 P(𝑋𝑋𝑖𝑖 = 𝑥𝑥𝑖𝑖|𝑌𝑌 = 𝑘𝑘) 𝑖𝑖=1
Naïve Bayes Formula
• Again, Suppose there are total 𝑚𝑚 classes for the response
variable 𝑌𝑌, and the prior probability for each class
P𝑌𝑌=𝑘𝑘 =𝜋𝜋𝑘𝑘.
Naïve Bayes classifier (recall 𝑥𝑥 = (𝑥𝑥1, 𝑥𝑥2, ⋯ , 𝑥𝑥𝑝𝑝)):
• The above probabilities are easier to estimate from the
∑𝑚𝑚 P𝑌𝑌=𝑙𝑙∏𝑝𝑝 P(𝑋𝑋𝑖𝑖=𝑥𝑥𝑖𝑖|𝑌𝑌=𝑙𝑙) 𝑙𝑙=1 𝑖𝑖=1
training data.
• Then, for a new observation X, Naïve Bayes classifier assigns it to the class that gives the largest probability from the above equation.

Lecture 6 • Discrete Distribution
– Binomial Distribution
• Continuous Distribution
– Normal Distribution • Point Estimator
– Sample Mean
• Central Limit Theorem
– Sample Proportion

Pass the Exam!
• In an exam consisting of 10 true-and-false questions, if a student guesses the correct answer of each question randomly, what is the probability that he will pass the exam if he needs a minimum of 6 correct answers to pass?
– R command: 1- pbinom(5, 10, 0.5), we find that P(X ≥ 6) = 0.3769531.
• How would the probability change if the exam consists of 10 multiple-choice questions, each with 4 choices?
– In this case, the probability of guessing each question correctly is p = 0.25 (1 out of 4). All other parameter values remain the same.
– R command: 1- pbinom(5, 10, 0.25), we find that P(r ≥ 6) = 0. 01972771.

Family of Normal Distribution
• Each member in the family is characterized by 𝜇𝜇 and 𝜎𝜎.
• Normal distribution is actually a family of distributions.
• If 𝜇𝜇=0 and 𝜎𝜎=1, it is called the standard Normal distribution.
• The Excel file Two Normals.xlsx is designed to show how change in 𝜇𝜇 and 𝜎𝜎 affect the central location and dispersion of a normal distribution.

Central Limit Theorem
enough sample size (𝑛𝑛 ≥ 30), the distribution of the sample mean 𝑥𝑥̅ will be
Theorem: Regardless of the distribution of the original population, for large
𝜇𝜇 = 𝜇𝜇) and the standard error equal to the population standard deviation
divided by square root of the sample size (i.e., 𝜎𝜎 = 𝜎𝜎/ 𝑛𝑛). 𝑥𝑥̅
approximately normal with the mean equal to the population mean (i.e.,

– ConfidenceIntervalfor𝜇𝜇 • When 𝜎𝜎 is known
• Sample Size Determination
• Introduction to Hypothesis Test
• Two types of errors (significance level)
Comparisons of Means and Proportions
• Three Forms of Hypothesis Statements
• 𝑝𝑝-value
• Independent samples • Paired samples
Lectures 7 & 8
• Confidence Interval Estimation
Hypothesis Test for a Single Parameter
• When 𝜎𝜎 is unknown
• Sample Size Determination
– Hypothesis Test about Mean 𝜇𝜇 • 𝜎𝜎 is known
– ConfidenceIntervalfor Proportion 𝑝𝑝
𝑝𝑝 • 𝜎𝜎 is unknown
– Hypothesis Test about Proportion
Difference between Means (𝜇𝜇 and 𝜇𝜇 )
Difference between Proportions (𝑝𝑝 and 𝑝𝑝 )
• Independent samples

The 𝑝𝑝-value of the Test Statistic
• The sample statistic used to test a null hypothesis is called “test statistic” (e.g., 𝑥𝑥̅ in Example 7.5).
• In the language of hypothesis test, the probability of getting a particular test statistic or a more extreme one when 𝐻𝐻0 is true is referred to as the 𝑝𝑝-value.
• When the 𝑝𝑝-value is “small”, it is an indication that 𝐻𝐻0 is unlikely to be true and hence can be rejected.
• But the question is “how small is considered small”?
• This will depend on how much the investigator can allow him/herself to falsely reject a true 𝐻𝐻 .
For a left-tailed test, if θ* is the test statistic
obtained, then its p-value is the shaded region to its left, i.e., P(θ ≤ θ*).
For a right-tailed test, if θ* is the test statistic
obtained, then its p-value is the shaded region to its right, i.e., P(θ ≥ θ*).
θ* For a two-tailed test, if θ* is the test statistic
obtained, then its p-value is two times the shaded region to its right, i.e., P(θ ≥ θ*) × 2.

The R Functions
• t.test(x, y = NULL, alternative = c(“two.sided”, “less”, “greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, …)
– Single sample:
• y=NULL: test of a single variable, unknown variance • Can also be used for confidence interval estimation
– Two samples:
• paired = TRUE: paired samples
• var.equal=FALSE:equalvarianceorunequalvariance
• prop.test(counts1, totals2, conf.level=0.95, correct=FALSE)
– Single sample – Two samples

Lectures 9 & 10 • Linear Regression
– Simple Linear Regression • Regression Summary
• Assumptions • Application
– Multiple Linear Regression • Regression Summary
• Assumptions
• Selecting Subsets of Predictors
• Dummy Variables
• Extensions of the Linear Regressions

Example: Bordeaux Equation
• Large differences in price and quality between years, although wine is produced in a similar way
• Meant to be aged, so hard to tell if wine will be good when it is on the market
• Expert tasters predict which ones will be good
• , a Princeton economics professor, claimed he can predict wine quality without tasting the wine (March 1990)
• Reaction from the world’s most influential wine expert:
– “Ashenfelter is an absolute total sham”
– “rather like a movie critic who never goes to see the movie but tells you how good it is based on the actors and the director”

Adding More Predictors
Age, Kilometers
Age, Kilometers, Horsepower
Age, Kilometers, Horsepower, Weight
• Adding more predictors can always improve 𝑅𝑅2
• Diminishing returns as more predictors are added • Questions:
All (~ . )
– Arethesemodelsgood?
– Whichoneisthebest?Howtocomparedifferentmodels?

Selecting Subsets of Predictors
• Not all available predictors should be used: More independent
– Eachnewpredictorrequiresmoredata
– Causes overfitting: may include variables that are unrelated to the dependent variable
– Multicollinearity:twopredictorsarehighlycorrelated
• If there is multicollinearity in the model, two predictors represent the same
• This may make the predictors seem insignificant when they should be significant
• By removing just one of them, it can be discovered that the other predictor is significant in the model
• Two preferred qualities of independent variable sets:
– The variables are linearly related to the dependent variable without
– The set of independent variables is smaller rather than larger (assuming the model has good predictive & explanatory power)
variables = more (possible) problems
being correlated with one another

Extensions of the Linear Regressions
• Standard linear regression model:
• In reality, Y and X’s may not be related in the ways that are assumed previously. For examples:
– The relationship between Y and X may be better described by a curve than by a linear equation.
– The value of Y may not be affected independently by the value of each Xi. The effect of one unit increase in Xi on Y may also depend on the values of other independent variables.
– (Random errors may not have a constant variance. This will cause the fluctuation of y around the regression line to be dependent on E(y).)
• How do we build regression model that can take the above situations into account?
– Interaction term
– Non-linear Relationships: example
• quadratic model, polynomial model

Multiplicative Models • The two most commonly used multiplicative models:
– 𝑦𝑦=𝑒𝑒 –𝑦𝑦=𝛽𝛽𝑥𝑥𝛽𝛽𝑥𝑥𝛽𝛽⋯𝑥𝑥𝛽𝛽𝜖𝜖
𝛽𝛽0+𝛽𝛽1𝑥𝑥1+𝛽𝛽2𝑥𝑥2+⋯+𝛽𝛽𝑘𝑘𝑥𝑥𝑘𝑘+𝜖𝜖
• Both are non-linear models. For both models, 𝑦𝑦 = 𝐸𝐸(𝑦𝑦) � 𝜖𝜖. Thus, the variability of y depends on 𝐸𝐸(𝑦𝑦). In other words, the random errors of both models do not have a constant variance.
• We can “linearize” the models by taking log transformation as
01122 𝑘𝑘𝑘𝑘
– ln(𝑦𝑦) = 𝛽𝛽 + 𝛽𝛽 𝑥𝑥 + 𝛽𝛽 𝑥𝑥 + ⋯ + 𝛽𝛽 𝑥𝑥 + 𝜖𝜖 01122 𝑘𝑘𝑘𝑘
– ln 𝑦𝑦 = ln(𝛽𝛽0) + 𝛽𝛽1ln(𝑥𝑥1) + 𝛽𝛽2ln(𝑥𝑥2) + ⋯ + 𝛽𝛽𝑘𝑘ln(𝑥𝑥𝑘𝑘) + ln(𝜖𝜖)
• After log transformation, the variance of random errors can usually be stabilized to a constant.

Suggested Steps for Regression
1. Collect sample data to fit a selected model with the least square method.
– Select the best set of predictors
2. Model Validity Checks:
– The adjusted R2 takes the sample size into consideration.
– F Statistic is used to test “the overall usefulness of the model”, i.e., whether the dependent variable 𝑦𝑦 is linearly related to at least one of the independent variables.
– Individual t test is used to test whether the dependent variable 𝑦𝑦 is linearly related to an independent variable 𝑥𝑥𝑖𝑖.
3. Diagnostic Checks of the Regression Model:
–Checking Model Assumptions: Check whether regression residuals are
independently and normally distributed with a constant standard deviation.
– Checking Outliers: Check whether there are influential observations (outliers) in the data set that may have significant impact on the fitted regression model.

Thank You and Wish You All the Best!

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com