程序代写代做代考 data science Introduction to information system

Introduction to information system

Inference

Bowei Chen

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M Data Science

• Population and Sample

– Basic Concepts

– Sample Mean, Median, Mode

• Sampling Distribution

– Central Limit Theorem (CLT)

– Z-Score

• Unbiased Estimator

• Several Topics For Your Direct Study

– Chi-Square Distribution

– t Distribution

– F Distribution

• Appendix: Sample Quartile

Objectives

Quick Recap on

Continuous Distributions!

Continuous Uniform Distribution

• Notation

𝑋~U a, b

• PDF

𝑓 𝑥; 𝑎, 𝑏 =
1

𝑏 − 𝑎
, if 𝑥 ∈ [𝑎, 𝑏],

0, otherwise.

• Expectation and variance

𝔼(𝑋) =
𝑎 + 𝑏

2
,

𝕍(𝑋) =
(𝑏 − 𝑎)2

12
.

𝐴𝑟𝑒𝑎 =

𝑎

𝑏

𝑓 𝑥; 𝑎, 𝑏 𝑑𝑥 = 1

Example

If we want to look at the same size of the observations but project them only

onto [0,24], the continuous uniform distribution can be used!

… …

Exponential Distribution

• Notation

𝑋~𝐸𝑥𝑝(𝜆)

• PDF

𝑓 𝑥; 𝜆 =
𝜆𝑒−𝜆𝑥, if 𝑥 ≥ 0,
0, otherwise.

• Expectation and variance

𝔼(𝑋) =
1

𝜆
,

𝕍(𝑋) =
1

𝜆2
.

𝐴𝑟𝑒𝑎 =

0

𝑓 𝑥; 𝜆 𝑑𝑥 = 1

Memoryless Property

Let 𝑋 be exponentially distributed with parameter 𝜆. Suppose we know 𝑋 > 𝑥1.
What is the probability that 𝑋 is also greater than some value 𝑥1 + 𝑥2?

ℙ 𝑋 > 𝑥1 + 𝑥2 𝑋 > 𝑥1 =
ℙ(𝑋 > 𝑥1 + 𝑥2 and 𝑋 > 𝑥1)

ℙ(𝑋 > 𝑥1)

If 𝑋 > 𝑥1 + 𝑥2, 𝑋 > 𝑥1. Therefore

ℙ 𝑋 > 𝑥1 + 𝑥2 𝑋 > 𝑥1 =
ℙ(𝑋 > 𝑥1 + 𝑥2)

ℙ(𝑋 > 𝑥1)
=
𝑒−𝜆(𝑥1 +𝑥2)

𝑒−𝜆𝑥1
= 𝑒−𝜆𝑥2 = ℙ(𝑋 > 𝑥2)

The memoryless property means that

the future is independent of the past.

Normal/Gaussian Distribution

• Notation

𝑋~𝒩(𝜇, 𝜎2)

• PDF

𝑓 𝑥; 𝜇, 𝜎2 =
1

2𝜋𝜎2
exp −

𝑥 − 𝜇 2

2𝜎2
,

where −∞ < 𝑥 < ∞. • Expectation and variance 𝔼(𝑋) = 𝜇, 𝕍(𝑋) = 𝜎2. 𝐴𝑟𝑒𝑎 = −∞ ∞ 𝑓 𝑥; 𝜇, 𝜎2 𝑑𝑥 = 1 Standard Normal Distribution Population is the larger set of objects that we wish to study Sample is a set of “representative” objects that we choose in order to estimate the characteristics of population Example 𝑁 𝑛 = 𝑁! 𝑛! 𝑁 − 𝑛 ! 𝑁 = 5, 𝑛 = 3 5 3 = 5! 3! 5 − 3 ! = 120 6 × 2 = 10 The population is 5 students, indexed by 1, 2, 3, 4, 5. You want to randomly sample 3 students to analyse. How many different samples you can have? Each column is a possible sample Population Statistic/ estimator Sample Parameter Sampling Inference 𝑓(𝑥; 𝜃) 𝜃 Random Sample The random variables 𝑥1, ⋯ , 𝑥𝑛 are called a random sample of size 𝒏 from the population 𝑓(𝑥) if 𝑥1, ⋯ , 𝑥𝑛 are mutually independent random variables and the marginal PDF or PMF of each 𝑥𝑖 is the same function 𝑓(𝑥). In other words, 𝑥1, ⋯ , 𝑥𝑛 are called independent and identically distributed (i.i.d. or IID) random variables with PDF or PMF 𝑓(𝑥). Joint PDF/PMF Expression without parameter: 𝑓 𝑥1, ⋯ , 𝑥𝑛 = 𝑓 𝑥1 ⋯𝑓 𝑥𝑛 = 𝑖=1 𝑛 𝑓(𝑥𝑖) Expression with parameter: 𝑓 𝑥1, ⋯ , 𝑥𝑛 ∣ 𝜃 = 𝑓 𝑥1 ∣ 𝜃 ⋯𝑓 𝑥𝑛 ∣ 𝜃 = 𝑖=1 𝑛 𝑓(𝑥𝑖 ∣ 𝜃) = ℒ 𝜃 This is called likelihood function, will be studied in later lectures. Please Google it if you are interested! Sample Mean The average of a sample is called the sample mean. Give some quantitative data 𝑥1, 𝑥2, ⋯ , 𝑥𝑛, the sample mean, denoted by 𝑥, is defined as 𝑥 = 𝑥1 +⋯+ 𝑥𝑛 𝑛 = 𝑖=1 𝑛 𝑥𝑖 . Note: the above definition is the arithmetic mean. Example: Sample Mean In each of 20 homes, people were asked how many cars were registered to their households. The results were recorded as follows: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0 The sample mean is then 𝑥 = 1 + 2 + 1 +⋯+ 0 20 = 2 Sample Median The sample median is the middle value of a distribution of numbers, denoted by 𝑚. Give some quantitative data 𝑥1, 𝑥2, ⋯ , 𝑥𝑛, the sample mean is 𝑚 = 𝑥(𝑘+1), if 𝑛 = 2𝑘 + 1 odd number , 1 2 𝑥(𝑘) + 𝑥(𝑘+1) if 𝑛 = 2𝑘 even number , where 𝑥 1 , 𝑥(2), ⋯ , 𝑥 𝑛 are sorted from small to larger, called order statistics. Example: Sample Median In each of 20 homes, people were asked how many cars were registered to their households. The results were recorded as follows: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0 The order statistics are then 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4 so 𝑚 = 1.5 Sample Mode The sample mode is the number in a sample data that is repeated more often than any other. In each of 20 homes, people were asked how many cars were registered to their households. The results were recorded as follows: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0 The sample mode is 1. Number of cars Frequency 0 4 1 6 2 5 3 3 4 2 Measures of central location Advantages Disadvantages Mean All the data is used to find the answer Very large or very small numbers can distort the answer Median Very big and very small values don't affect it Takes a long time to calculate for a very large set of data Mode The only average we can use when the data is not numerical 1) There may be more than one mode 2) There may be no mode at all if all of the data is the same 3) It may not accurately represent the data Sampling Distribution Since an estimator 𝜃 is a function of random variables, it follows that 𝜃 is also an random variable. The probability distribution of an estimator is called a sampling distribution. Example: • Sample mean • Sample variance Statistic/ estimatorParameter Inference 𝑓(𝑥; 𝜃) 𝜃 Central Limit Theorem (CLT) Let 𝑥1, ⋯ , 𝑥𝑛 be a set of i.i.d. random variables and each variable has a population mean 𝜇 and a finite variance 𝜎2. Then lim 𝑛→∞ 𝑛 𝑖=1 𝑛 𝑥𝑖 𝑛 − 𝜇 → 𝒩(0, 𝜎2) Hint: Using Moment Generating Function 𝑚𝑋 𝑡 = 𝔼(𝑒 𝑋𝑡), please Google it! Z-Score 𝑧 = 𝑥 − 𝜇 𝜎 𝑛 ∼ 𝒩(0, 1) When you obtain a z-score, check its corresponding probability from this table ! = 0.5 + 0.3944 Unbiased Estimator Let 𝑋 be a random variable with PDF 𝑓(𝑥; 𝜃). Let 𝑥1, ⋯ , 𝑥𝑛 be a random sample from the distribution of 𝑋. Let 𝜃 denote a statistic, then it is an unbiased estimator of 𝜃 if 𝔼 𝜃 = 𝜃 If 𝜃 is not unbiased, we say that 𝜃 is a biased estimator of 𝜃, with 𝐵𝑖𝑎𝑠 𝜃 = 𝔼 𝜃 − 𝜃 Example Let 𝑥1, ⋯ , 𝑥𝑛 be a random sample from the distribution of 𝑋, where 𝑋 has an unknown mean 𝜇 and variance 𝜎2. Then 𝜇 = 𝑥 = 𝑖=1 𝑛 𝑥𝑖 𝑛 Question We create a sample x in R The standard deviation x in R is However, using the definition of the (population) standard deviation, we have 𝜎 = 1 𝑛 𝑖=1 𝑛 𝑥𝑖 − 𝑥 2 = 2.312345 Hint: unbiased estimator for population variance Several Topics For Your Direct Study Please read Casella’s book if you are interested! • Chi-Square Distribution – It is a special case of Gamma Distribution – It explains sample variance calculation. • t Distribution – It is useful to make inference about the mean of a normal distribution when variance is also unknown • F Distribution – It is useful to make inferences about the ratio of two variances (when we assume that the variances of two populations are equal). References • G.Casella and R.Berger (2002) Statistical Inference. Chapter 5 Thank You! bchen@Lincoln.ac.uk mailto:bchen@Lincoln.ac.uk Appendix: Quartile Quantile Give some quantitative data 𝑥1, 𝑥2, ⋯ , 𝑥𝑛, the sample p th quantile for the kth order statistics, 𝑝[𝑘], is defined by 𝑝[𝑘] = 𝑘 − 1 𝑛 − 1 , k ≤ 𝑛, 0 ≤ 𝑝 𝑘 ≤ 1 Then the pth quantile the 𝑝[𝑘] 𝑛 − 1 + 1 st order statistic. When 𝑝[𝑘] 𝑛 − 1 + 1 st is not an integer, linear interpolation is used between order statistics to arrive at the pth quantile. Note: this is just one popular way to calculate quantile, used by R and S-Plus! Linear Interpolation (1/2) Recall the definition of sample median when 𝑛 = 2𝑘 even number , then 𝑚 = 1 2 𝑥(𝑘) + 𝑥(𝑘+1) . This is the linear interpolation because 𝑚 − 𝑥(𝑘) 𝑥(𝑘+1) −𝑚 = 1 2 ,⇒ 𝑚 = 1 2 𝑥(𝑘) + 𝑥(𝑘+1) = 𝑥(𝑘) + 1 2 𝑥(𝑘+1) − 𝑥(𝑘+1) . Also, by this definition, it is seen that the 50th quantile (or 50th percentile) is the median since 0.5 = 𝑘 − 1 𝑛 − 1 ⇒ k = 𝑛 + 1 2 . Linear Interpolation (2/2) When 𝑝[𝑘] 𝑛 − 1 + 1 st is not an integer, the sample pth quantile is then 𝑥( 𝑝[𝑘] 𝑛−1 +1 ) + 𝑝 𝑘 𝑥 𝑝 𝑘 𝑛−1 +1 +1 − 𝑥 𝑝 𝑘 𝑛−1 +1 , where 𝑎 the floor function of number 𝑎 ∈ ℝ, representing the largest integer not greater than 𝑎. 𝑎 𝑎 2 2 2.4 2 2.9 2 −2.7 −3 −2 −2 Quartiles The first (25%), second (50%), and third (75%) sample quartiles are denoted by 𝑄1, 𝑄2, 𝑄3, respectively. Given the order statistics: 𝑥(1) = 1, 𝑥(2) = 4, 𝑥(3) = 7, 𝑥(4) = 9, 𝑥(5) = 10, 𝑥(6) = 14, 𝑥(7) = 15, 𝑥(8) = 16, 𝑥(9) = 20, 𝑥(10) = 21, calculate 𝑄1, 𝑄2, 𝑄3 • 0.25 = 𝑘−1 10−1 , ⇒ 𝑘 = 3.25 • 0.5 = 𝑘−1 10−1 , ⇒ 𝑘 = 5.5 • 0.75 = 𝑘−1 10−1 , ⇒ 𝑘 = 7.75 • 𝑄1 = 𝑥(3) + 0.25 𝑥 4 − 𝑥 3 = 7.5 • 𝑄2 = 𝑥(5) + 0.5 𝑥 6 − 𝑥 5 = 12 • 𝑄3 = 𝑥(7) + 0.75 𝑥 8 − 𝑥 7 = 15.75