Statistical Inference STAT 431
Lecture 1: Introduction
Course Information
• Instructor: Zongming Ma (zongming@wharton.upenn.edu)
• Teaching assistant: Ruijia “Rachel” Wu (ruijiawu@wharton.upenn.edu)
• Office hours:
– ZM:Mon12noon–1pm(orbyappt);
– RW: Tue/Fri 2 – 3 pm.
• Textbook
– Statistics and Data Analysis: from Elementary to Intermediate
By Tamhane and Dunlop. Prentice Hall, 2000.
• Course website
– Canvas: http://canvas.upenn.edu
– All instructions (lectures & office hours) will be conducted via BlueJeans.
– Lecture notes will be posted prior to each lecture.
– Datasets, sample code, assignments, solutions, grades, etc.
STAT 431 2
Course Information
• Assignments
– Five assignments in total [Only four highest scores count toward your grade.]
– Collaboration is permitted, but individual write-up is required.
(with acknowledgment of your collaborators)
– No late homework will be accepted!
• Exams [Open-book]
– Midterm: Wednesday, October 7, 6-8 p.m.
– Final: TBA.
• Grading policy
– Homework assignments: 30%;
– Midterm: 30%;
– Final: 40%.
STAT 431 3
What Is This Course About?
– “The art of making numerical conjectures about puzzling questions.”
• What is Statistics?
— Freedman, Pasani, and Purves, Statistics, 3rd ed.
– “The science of collecting and analyzing data for the purpose of drawing conclusions and making decisions.”
— Tamhane and Dunlop, Statistics and Data Analysis.
• The goal of this course
– To examine a collection of important statistical concepts and methods.
(Methodology)
– To understand when and how to apply these methods, and why.
(Theory)
– To apply them on real world data.
(Application & computing)
STAT 431 4
What You Should Have Already Known
• Probability at the level of STAT 430:
– Random variables;
– Probability distributions, probability density / mass function;
– Mean / variance / SD / quantile of a distribution;
– Jointly distributed random variables, conditional probability, independence;
– Covariance, correlation;
– Normal distribution, binomial distribution;
– Moment generating function;
– Law of large number, central limit theorem;
– Etc.
• Calculus will also be used.
• Linear algebra is not required.
• Previous exposure to statistical computing is not required.
STAT 431 5
Example: Revenue Neutral Tax Bill
• Problem setup:
– A senator proposes a new tax bill to simplify the tax code
– The senator claims that the changes are revenue neutral:
On balance, tax revenues will not change.
• Data collection:
– To evaluate the senator’s claim, a random sample of 100 tax returns is
selected; They are recomputed, using the proposed rule changes.
– Results: average change in sample = -$219, sample SD = $725.
Adapted from Freedman, Pisani, Purves, and Adhikari, Statistics, 2nd Ed. STAT 431 6
Key Concept: Population vs. Sample
• Population / sample
– Unit: a single entity whose characteristics are of interest (tax return form);
– Population: the complete collection of units about which info is sought (all tax return forms of the U.S. in a year);
– Sample: a subset of a population that is actually observed.
• Variables / data
– Variables: measurable properties / attributes associated with each unit (change in tax paid);
– Data: a collection of measured values of variables.
• Parameter / statistic
– Parameter: a numerical characteristic of a population for a specific variable (average change in tax paid per tax return form);
– Statistic: a numerical function of the sample data, used to make inference about the unknown parameter.
STAT 431 7
A Random Sample Paradigm
• Connection to probability
– The values of variables in population are usually modeled by some probability distribution F with some unknown parameter ✓ .
– Each observation Xi is then a random variable following distribution F .
• A random sample of size n refers to a set of RV’s X1,…,Xn , such that
1. The X ’s are independent RV’s; and i
2. Every Xi has the same probability distribution.
We also say that the X ’s are independently and identically distributed (IID).
• Then the statistic, i.e., a function of the Xi’s, follows a distribution (called sampling distribution) that also depends on the unknown parameter ✓ . We make inference on ✓ based on the sampling distribution.
• We typically write X1 , . . . , Xn for hypothetical random sample, and the lower case x1, . . . , xn for their observed values.
i
STAT 431 8
Example: Revenue Neutral Tax Bill (Cont’d)
• Parameter of interest: average change in tax paid over all tax return forms.
• Statistic: average change in tax paid over the sampled tax return forms.
• Sampling distribution of the statistic:
iid
– We are interested in thPe parameter μ.
– Thestatisticis X ̄ = 1 Xi. n
– By STAT 430 knowledge, we know
X ̄ ⇠ N ( μ , 1 2 )
• From the observed data:
– Observed average change is -$219, with sample SD = $725 ⇡ .
• What does the data say about the senator’s claim of “revenue neutral”?
STAT 431 9
– Assuming a normal population X1,…,Xn ⇠ N(μ, 2).
n
Three Statistical Tasks Given a question of interest …
1. 2.
3.
Collecting data
– Which variables should be measured and how?
Summarizing and exploring data (descriptive statistics)
– It’s hard to think about a long list of numbers.
– Better to use summary statistics / tables / graphical displays.
Drawing conclusions and making decisions based on data (inferential statistics)
– Modeling: turning question of interest into numerical conjectures about
parameters in the model.
– Inference on the model parameters:
estimation / testing / confidence interval.
– Connection to descriptive statistics: model assumptions & diagnostics.
STAT 431 10
An Overview of The Course – Summary statistics / tables / graphical displays.
• Inferential statistics
– Sampling distributions of statistics;
– Inferences for one sample / two samples;
– Simple linear regression;
– Multiple linear regression;
– Likelihood – a general inference framework (time permitting).
• Application of statistical methodologies on data using R.
• Descriptive statistics
STAT 431 11
Statistical Computing
• Software: R
– Becoming a “lingua franca” for data analysts
— Vance, Data Analysts Captivated by R’s Power, January 6, 2009, NY Times.
– Freely available (with tutorial) at www.r-project.org.
– User friendly interface: Rstudio.
• In class
– Lectures will cover basic usage of the software.
(with demonstration on data examples)
• After class
– R sessions will be held based on students’ need.
– Some tutorials and sample code will also be posted on course website.
STAT 431 12
Class Summary
– Population / sample, variable / data, parameter / statistic
• Key points of this class – Random sample
• After today’s class
– Reading: Chapters 1 & 2 of the textbook
– Install R on your personal computer
• Next class: Summarizing Data – One Variable
STAT 431 13