Introduction
(Module 1)
Statistics (MAST20005) & Elements of Statistics (MAST90058)
School of Mathematics and Statistics University of Melbourne
Copyright By PowCoder代写 加微信 powcoder
Semester 2, 2022
Aims of this module
• Brief information about this subject
• Brief revision of some prerequisite knowledge (probability)
• Introduce some basic elements of statistics, data analysis and visualisation
Subject information
Review of probability Descriptive statistics Basic data visualisations
What is statistics?
Let’s see some examples. . .
Climate change modelling
Discovery of the (the ‘God Particle’)
Smoking leads to lung cancer
“The The Mortality of Doctors in Relation to Their Smoking Habits” British Medical Journal (1954)
A/B testing for websites
41 shades of blue?!
Web analytics
Skin texture image analysis
Goals of statistics
• Answer questions using data • Evaluate evidence
• Optimise study design
• Make decisions
And, importantly:
• Clarify assumptions • Quantify uncertainty
Why study statistics?
“The best thing about being a statistician is that you get to play in everyone’s backyard.”
— . Tukey (1915–2000)
“I keep saying the sexy job in the next ten years will be statisticians. . . The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it’s going to be a hugely important skill in the next decades. . . ”
— , Google’s Chief Economist, Jan 2009
The best job
U.S. News Best Business Jobs in 2022:
1. Medical and Health Services Manager 2. Financial Manager
3. Statistician
CareerCast (recruitment website) of 2021:
1. Data Scientist
2. Genetic Counselor 3. Statistician
Subject overview
Statistics (MAST20005), Elements of Statistics (MAST90058)
These subjects introduce the basic elements of statistical modelling, statistical computation and data analysis. They demonstrate that many commonly used statistical procedures arise as applications of a common theory. They are an entry point to further study of both mathematical and applied statistics, as well as broader data science.
Students will develop the ability to fit statistical models to data, estimate parameters of interest and test hypotheses. Both classical and Bayesian approaches will be covered. The importance of the underlying mathematical theory of statistics and the use of modern statistical software will be emphasised.
Joint teaching
MAST20005 and MAST90058 share the same lectures but have separate tutorials and lab classes. The teaching and assessment material for both subjects will overlap significantly.
Subject website (LMS)
• Full information is on the subject website, available through the Learning Management System (LMS).
• Only a brief overview is covered in these notes. Please read all of the info on the LMS as well.
• New material (e.g. problem sets, assignments, solutions) and announcements will appear regularly on the LMS.
• This subject introduces basic statistical computing and programming skills.
• We make extensive use of the R statistical software environment.
• Knowledge of R will be essential for some of the tutorial problems,
assignment questions and will also be examined.
• We will use the RStudio program as a convenient interface with R.
Staff contacts
Subject coordinator / Lecturer
Dr Lecturer
Dr Tutorial coordinator
Dr Rekha the LMS for details of consultation times
Discussion forum
• Access via the LMS
• Post any general questions on the forum • Do not send them by email to staff
• You can answer each others’ questions
• Staff will also help to answer questions
Student representatives
Student representatives assist the teaching staff to ensure good communication and feedback from students.
See the LMS to find the contact details of your representatives.
What is Data Science?
Data science is a ‘team sport’
How to succeed in statistics / data science?
• Get experience with real data
• Develop your computational skills, learn R
• Understand the mathematical theory
• Collaborate with others, use the discussion forum
This subject is challenging
• It is mathematical
◦ Manipulating equations
◦ Calculus
◦ Probability ◦ Proofs
• But the ‘real’ world also matters ◦ Context can ‘trump’ mathematics
◦ More than one correct answer
◦ Often uncertain about the answer
In 2017: 341 students
60% Bachelor of Commerce 24% Bachelor of Science
6% Master of Science (Bioinformatics) 10% 8 other degrees/categories
What are your strengths and weaknesses?
Get extra help
• Your classmates
• Discussion forum • Consultation times • Textbooks
1. Log in to the discussion forum
2. Install RStudio on your computer
3. Start reading the R introduction and reference guide before week 2
The best way to learn statistics is by solving problems and ‘getting your hands dirty’ with data.
We encourage you to attend all lectures, tutorial and computer labs to get as much practice and feedback as possible.
Good luck!
Subject information
Review of probability
Descriptive statistics Basic data visualisations
Why probability?
• It forms the mathematical foundation for statistical models and procedures
• Let’s review what we know already. . .
Random variables (notation)
• Random variables (rvs) are denoted by uppercase letters: X, Y , Z, etc.
• Outcomes, or realisations, of random variables are denoted by corresponding lowercase letters: x, y, z, etc.
Distribution functions
• The cumulative distribution function (cdf) of X is F(x)=Pr(X x), −∞
• F(x)increasesto1asx→∞anddecreasesto0asx→−∞
• If the rv has a certain distribution with pdf f (or pmf p), we write
X∼f (orX∼p)
Example: Unemployment duration
A large group of individuals have recently lost their jobs. Let X denote the length of time (in months) that any particular individual will stay unemployed. It was found that this was well-described by the following pdf:
0, otherwise.
1e−x/2, x ≥ 0,
−2 0 2 4 6 8 10 0 5 10 15
Clearly, f(x) ≥ 0 for any x and the total area under the pdf is:
∞ ∞ Pr(−∞
Normal distribution
• X∼N(μ,σ2)withpdf
1 − (x−μ)2
f(x)= √ e 2σ2 , x∈(−∞,∞), μ∈(−∞,∞), σ>0
• It is important in applications because of the Central Limit
Theorem (CLT)
• Properties:
E(X) = μ var(X) = σ2
MX (t) = etμ+t2σ2/2
• When μ = 0 and σ = 1 we have the standard normal distribution.
• IfX∼N(μ,σ2),
Z = X − μ ∼ N(0, 1) σ
Let X be a continuous rv. The pth quantile of its distribution is a number πp such that p = Pr(X πp) = F(πp).
In other words, the area under f(x) to the left of πp is p:
f(x)dx = F(πp) • πp is also called the (100p)th percentile
• The 50th percentile (0.5 quantile) is the median, denoted by m = π0.5
• The 25th and 75th percentiles are the first and third quartiles, denoted by q1 = π0.25 and q3 = π0.75
Example: Weibull distribution
The time X until failure of a certain product has the pdf 3×2 −(x/4)3
f(x)= 4 e , x∈(0,∞). F(x)=1−e−(x/4)3, x∈(0,∞)
The cdf is
Then π0.3 satisfies 0.3 = F(π0.3). Therefore, 1 − e−(π0.3/4)3 = 0.3
⇒ ln(0.7) = −(π0.3/4)3
⇒ π0.3 = −4(ln 0.7)1/3 = 2.84.
Law of Large Numbers (LLN)
Consider a collection X1, . . . , Xn of independent and identically distributed (iid) random variables with E(X) = μ < ∞, then with probability 1 we have:
n Xi→μ,asn→∞.
The LLN ‘guarantees’ that long-run averages behave as we expect
Central Limit Theorem (CLT)
Consider a collection X1,...,Xn of iid rvs with E(X) = μ < ∞ and var(X) = σ2 < ∞. Let,
̄ 1n X=n Xi.
σ/ n follows a N(0, 1) distribution as n → ∞.
This is an extremely important theorem!
It provides the ‘magic’ that will make statistical analysis work.
Let X1,...,X25 be iid rvs where Xi ∼ Exp(λ = 1/5). Recall that E(X) = 5.
Thus, the LLN implies
X ̄ → E ( X ) = 5 . Moreover, since var(X) = 1/λ2 = 25, we have
̄ 11 52 X≈N λ,nλ2 =N 5,25
Is n = 25 large enough?
A simulation exercise
Generate B = 1000 samples of size n. For each sample compute x ̄. The continuous curve is the normal N(5,52/n) distribution prescribed by the CLT.
Sample 1: x(1),...,x(1) → x ̄(1) 1n
Sample 2: x(2),...,x(2) → x ̄(2) 1n
Sample B: x(B),...,x(B) → x ̄(B)
Then represent the distribution of {x ̄(b), b = 1, . . . , B} by a histogram.
A simulation exercise
The distribution of X ̄ approaches the theoretical distribution (CLT). Moreover it will be more and more concentrated around μ (LLN). To see this, note that var(X ̄) = σ2/n → 0 as n → ∞.
−10 −5 0 5 10 15 20 0 5 10 15
n = 25 n = 100
2 4 6 8 10 2 4 6 8 10
0.0 0.1 0.2 0.3 0.4
0.00 0.05 0.10 0.15
0.0 0.2 0.4 0.6 0.8
0.00 0.10 0.20
Challenge problem
Let X1,X2,...,X25 be iid rvs with pdf f(x) = ax3 where 0 < x < 2.
1. What is the value of a?
2. Calculate E(X1) and var(X1).
3. What is an approximate value of Pr(X ̄ < 1.5)?
Subject information Review of probability Descriptive statistics Basic data visualisations
Statistics: the big picture
Example: Stress and cancer
• An experiment gives independent measurements on 10 mice • Mice are divided in control and stress groups
• The biologist considers two different proteins:
◦ Vascular endothelial growth factor C (VEGFC) ◦ Prostaglandin-endoperoxide synthase 2 (COX2)
Mouse Group VEGFC COX2
1 Control 0.96718 14.05901
2 Control 0.51940 6.92926
3 Control 0.73276 0.02799
4 Control 0.96008 6.16924
5 Control 1.25964 7.32697
6 Stress 4.05745 6.45443
7 Stress 2.41335 12.95572
8 Stress 1.52595 13.26786
9 Stress 6.07073 55.03024
10 Stress 5.07592 29.92790
Data & sampling
• The data are numbers:
• The model for the data is a random sample, that is a sequence of
This model is equivalent to random selection from a hypothetical
infinite population.
• The goal is to use the data to learn about the distribution of the
random variables (and, therefore, the population).
X1,X2,...,Xn
• A statistic T = φ(X1,...,Xn) is a function of the sample and its realisation is denoted by t = φ(x1, . . . , xn).
• Note: the word “statistic” can also be used to refer to both the realisation, t, as well as the random variable, T . Sometime need to be more specific about which one is meant.
• A statistic has two purposes:
◦ Describe or summarise the sample — descriptive statistics
◦ Estimate the distribution generating the sample — inferential statistics
• A statistic can be both descriptive and inferential, it depends on how you wish to use/interpret it (see later)
• We now introduce some commonly used descriptive statistics. . .
Moment statistics
1 n 23.59 Samplemean=x ̄=n xi= 10 =2.359
Sample variance = s2 = n − 1 (xi − x ̄)2 = 3.98761
Sample standard deviation = s =
These are ‘sample’ or ‘empirical’ versions of moments of a random
Empirical means ‘derived from the data’
3.98761 = 1.9969
Order statistics
Arrange the sample x1, . . . , xn in order of increasing magnitude and define:
x(1) x(2) · · · x(n) Then x(k) is the kth order statistic.
Special cases:
• x(1) is the sample minimum • x(n) is the sample maximum
For the example data,
x(1) = 0.52, x(2) = 0.73, ..., x(10) = 6.07
What is x(3.25)?
Let it be 0.25 of the way from x(3) to x(4),
x(3.25) = x(3) + 0.25 · (x(4) − x(3)) = 0.96 + 0.25 · (0.97 − 0.96) = 0.9625
In other words, define it via linear interpolation. Exercise: verify that x(7.75) = 3.6480
Why do this? It allows us to define. . .
Sample quantiles
General definition (‘Type 7’ quantiles):
πˆp =x(k), wherek=1+(n−1)p
Special cases:
Sample median = πˆ0.5 = x(5.5) = 1.26 + 1.53 = 1.395 2
Sample 1st quartile = πˆ0.25 = x(3.25) = 0.9625 Sample 3rd quartile = πˆ0.75 = x(7.75) = 3.6480
Interquartile range = πˆ0.75 − πˆ0.25 = 2.685
πˆ0.25 and πˆ0.75 contain about 50% of the sample between them 77 of 100
Some descriptive statistics in R
> x <- round(VEGFC, digit = 2)
[1] 0.97 0.52 0.73 0.96 1.26 4.06 2.41 1.53 6.07 5.08
> sort(x) # order statistics
[1] 0.52 0.73 0.96 0.97 1.26 1.53 2.41 4.06 5.08 6.07
> summary(x) # sample mean & sample quantiles
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5200 0.9625 1.3950 2.3590 3.6475 6.0700
> var(x) # sample variance
[1] 3.98761
> sd(x) # sample standard deviation
[1] 1.9969
> IQR(x) # interquartile range
Frequency statistics
Can also define empirical versions of pdf, pmf, cdf Will see in the next section. . .
Subject information Review of probability Descriptive statistics Basic data visualisations
Graphical summary of data from a single variable
Main box: πˆ0.25, πˆ0.5, πˆ0.75
‘Whiskers’: x(1), x(n)
(but R does something more complicated, see tutorial problems)
Convenient way of comparing data from different groups
Example: VEGFC (Stress vs Control)
Control Stress
Scatter plot
For comparing data from two variables (usually continuous)
0 10 20 30 40 50
Empirical cdf
The sample cdf, or empirical cdf, is defined as
where I(·) is the indicator function (I(xi x) has value 1 if xi x and value 0 if xi > x).
For example, for the previous data,
Fˆ(2)= I(xi 2)=
ecdf(VEGFC)
0.0 0.2 0.4 0.6 0.8 1.0
It has the form of a discrete cdf. However, it will approximate the cdf of a continuous variable if the sample size is large. The following diagram shows cdfs based on n = 50 and n = 200 observations sampled from a standard normal distribution, N(0, 1).
-3 -2 -1 0 1 2 3 x
-3 -2 -1 0 1 2 3 x
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Empirical pmf
If the underlying variable is discrete we use the pmf corresponding to
the sample cdf Fˆ
pˆ ( x ) = n
For example, the following shows pˆ(x)
of size n = 15 from Pn(5) (left) and the true pmf p(x) of Pn(5) (right)
2 3 4 5 6 7 8 9 10 x
I ( x i = x )
0 1 2 3 4 5 6 7 8 9 10 11 12 13 x
p(x) 01234
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com