CS570 Biomedical Science & Health IT
CS544 D1
Foundations of Analytics
Module 5
Guanglan Zhang
1
1
Inference about a population
A parameter is a number that describes a population.
A statistic is a number that is computed from the sample data.
Typical notation for population parameters and their corresponding sample statistic:
2
2
In practice, the value of the parameter is unknown since we often are not able to examine the entire population.
In practice, we use information contained in sample statistics to estimate the unknown population parameter.
The sample mean, x¯, is an estimate of the population mean, μ.
A parameter vs. a statistic
A researcher was interested in estimating the mean income level for college graduates aged 25–30.
In order to do so, a random sample of 2000 college graduates was taken.
The mean income of the sample was =$45,455, which is a statistic.
The unknown population parameter, μ, would be the mean of all college graduates aged 25–30.
It is unknown and is estimated by the sample mean.
The actual value of μ could only be obtained if you took the arithmetic average of all approximately 12 million college graduates aged 25–30.
3
3
The Central Limit Theorem
When the number of samples taken from a population is sufficiently large, the sampling distribution of the sample mean, , will be approximately normally distributed with an expected value of μ and a standard deviation of where μ and σ are the mean and the standard deviation from the population.
Say you take a random sample of size n from a population with mean μ and standard deviation σ.
The larger the sample size, the closer the sampling distribution of the sample means will be to the normal distribution (and the smaller the variance will be of the sample mean).
4
4
Sampling Methods
A sample is a portion of the population that is selected for doing the data analysis.
The results from this sample are then used to estimate the characteristics of the population.
A frame is defined as a listing of items that define a population. Samples are drawn from the frames.
5
5
Probability and nonprobability sampling
Broadly, samples can be classified into two categories – probability samples and nonprobability samples.
In probability sampling, the items for the sample are selected based on known probabilities.
The common methods used in probability sampling are simple random sampling, systematic sampling, stratified sampling, and cluster sampling.
Common nonprobability sampling techniques include convenience sampling and judgment sampling.
6
6
Simple random sampling
The frame is a list of all the units, items, people, etc., that define the population to be studied.
In simple random sampling, every item from a frame has the same chance for selection as every other item.
Suppose N represents the size of the frame or the number of items in the frame. For selecting a sample of size n, the probability of selecting the first member of the sample is 1/N.
Samples can be chosen with replacement or without replacement.
If sampling with replacement is used, the probability of selecting any member for the sample is 1/N.
If sampling without replacement is used, the probability of selecting the second member is 1/(N−1), etc.
The process is repeated until the desired sample size is selected.
7
7
Systematic Sampling
In systematic sampling, for selecting a sample of size n, the N items from the frame are partitioned into n groups. Each group has k items, where k=N/n, rounded to the nearest integer.
The first item for the sample is randomly selected from the first set of k items in the frame. After the first selection, the remaining n−1 items are selected by taking every kth item from the frame.
From a population of 1000 students, if a sample of size 50 is to selected, then k=1000/50=20. The first sample is selected at random from the first 20 students. Suppose the student 13 is selected as the first item.
The subsequent selections will be every 20th student after the first selection, i.e., 33, 53, 73, ……, 953, 973, and 993.
Selection bias may occur as a result of systematic sampling if there is a pattern in the input frame.
8
8
Stratified Sampling
In stratified sampling, the N items from the frame are subdivided into separate subgroups based on some common characteristic, e.g., gender, race, year of school, etc.
The subgroups are known as strata.
Simple random samples are selected from each stratum and combined for the desired sample of size n. The number of samples selected from each stratum is proportional to the relative size of that stratum with respect to the entire frame.
9
https://en.wikipedia.org/wiki/Stratified_sampling
9
Systematic sampling is to be applied only if the given population is logically homogeneous, because systematic sample units are uniformly distributed over the population. The researcher must ensure that the chosen sampling interval does not hide a pattern. Any pattern would threaten randomness.
Cluster Sampling
In cluster sampling, the population is divided into groups called clusters. Each cluster should mirror the entire population. The clusters should be mutually exclusive and collectively exhaustive
A random sample of these groups (or clusters) is then selected.
The sample is made up of all the members of these selected clusters, a process termed as one-stage cluster sampling.
In two-stage cluster sampling, members are sampled from each selected cluster using simple random sampling or systematic random sampling.
10
https://en.wikipedia.org/wiki/Cluster_sampling
10
Cluster sampling is a sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population.
Errors
Even with the random probability sampling methods, surveys taken from samples of the population have errors.
The errors are classified under the following types:
coverage errors
occur when certain groups of items are excluded from the frame
nonresponse errors
occur when data is not collected from all items in the sample
sampling errors:
In statistics, sampling error is the error caused by observing a sample instead of the whole population. When samples are used for estimating the characteristic of the population, the estimate changes from sample to sample and these variations are reflected through the sampling error.
11
11
Errors
The errors are classified under the following types:
measurement errors
Weakness in the survey questions leads to measurement error. A second source of measurement error is called the Hawthorne effect when the user changes their behavior to please the interviewer.
Noise
Also called random error or statistical uncertainty.
Data Dredging
Also referred to as data fishing or data snooping. It is the use of data mining or statistical methods to uncover patterns in data that can be presented as statistically significant, without first devising a specific hypothesis as to the underlying causality..
https://en.wikipedia.org/wiki/Hawthorne_effect
https://en.wikipedia.org/wiki/Data_dredging
12
12
Statistical Resampling
A data sample is used to estimate the population parameter. The problem is that we only have a single estimate of the population parameter, with little idea of the variability or uncertainty in the estimate.
One way to address this is by estimating the population parameter multiple times from our data sample. Statistical resampling methods are procedures that describe how to economically use available data to estimate a population parameter. The result can be both a more accurate estimate of the parameter (such as taking the mean of the estimates) and a quantification of the uncertainty of the estimate (such as adding a confidence interval).
Two commonly used resampling methods:
Bootstrap. Samples are drawn from the dataset with replacement (allowing the same sample to appear more than once in the sample), where those instances not drawn into the data sample may be used for the test set.
k-fold Cross-Validation. A dataset is partitioned into k groups, where each group is given the opportunity of being used as a held out test set leaving the remaining groups as the training set.
https://machinelearningmastery.com/statistical-sampling-and-resampling/
13
13
/docProps/thumbnail.jpeg