HD EDUCATION
Copyright By PowCoder代写 加微信 powcoder
TUTOR: Natsu
HD EDUCATION
HD EDUCATION
2.点【 举⼿ 】 即可与⽼ 师实时互动
3.问题被解答了还可以【 ⼿放下】
HD EDUCATION
Population: all the data
Sample: a part from population
Modules 1: Exploring Data
Main content:
– Introduction to R
– Design of experiment
– Controlled experiment
– Observational experiment
– Qualitative data
– Quantitative data
– Graphical summary: type of data of different graph & visualization.
– Numerical summary: main feature of data
– Basic R commands:
– For 1 variable (vector)
– For 2+ variables
HD EDUCATION
Design of experiment:
Domain knowledge – background context aided in understanding and interpreting data
Controlled experiments:
– Treatment group(实验组)
– Control group(对照组)
– Contemporaneous control group: occur at the same time as treatment group
– Historical control group: occur earlier than treatment group
Data is collected by comparing the responses of a control group and treatment group
Potential error can produced by:
Confounding: factors which affect the responses of subjects and data collected, compromising the
ability to make conclusions from the experiment -> cause confusion in interpretation
Bias: affects the ability of the data to accurately measure the treatment effect
Bias Solution
Selection bias -> The 2 groups are not randomly
Confounder -> difference between 2 groups
randomized controlled trial / experiment (RCT):
allocate subject randomly
Observer bias-> subjects or investigators are aware of
the identity of the 2 groups
then get bias in either responses / evaluations
Preform Randomised Controlled Double-Blind Trial
Double-Blind Trial where both the subjects (“single
blind”) and investigators (“double blind”) are not aware
of the identity of the 2 groups.
Placebo: pretend treatment
Placebo effect: an effect which occurs from the subject
thinking they have had the treatment
Consent Bias -> subjects choose whether or not they
take part in the experiment
Withhold treatment for those in the control group or
enforce treatment for those in the treatment group
->ethical issue
HD EDUCATION
Relatively best method of controlled experiment:
– Random allocation
– Double-blind design
– Placebo given
Limitation:
– The use of retrospective databases with insufficient information
– The lack of sufficient sample size to determine whether an effect on depression exists
– Lack of placebo controls and randomisation
– The lack of specific standardized assessments of depression and other behaviors
– Improvement:
– Large placebo controlled trials assessing the effects of isotretinoin on depression would be a scientific advance.
Observational study:
When investigator cannot use randomization for allocation to groups (also can not affect objects)
1. Observational studies cannot establish causation:
– Association may point to causation
– But association does not prove causation
– Example: higher rate of liver cancer among smokers
2. Observational studies can have misleading hidden confounders
Confounding occurs when the seeming effect of the treatment is also caused by other factors.
– Selection bias occurs when some subjects are more likely to be chosen to be in the study than others.
– Survivor bias is when certain types of subjects finish the study.
– Adherer bias occurs when certain types of subjects keep taking the treatment (or placebo)
– Non-adhere bias: as opposed to the adherer bias
Strategy for dealing with confounders:
Make groups more comparable by dividing them into subgroups with respect to confounder.
HD EDUCATION
3. Simpson paradox:
Sometimes there is a clear trend in individual groups of data that disappears when the groups are pooled
Occurs when relationships between percentages in subgroups are reversed when the subgroups are combined
4. Observational studies result from using an historical control.
At this case, time is a confounding variable.
– Controlled experiments need to be performed in the same time period (contemporaneously controlled group)
Initial data analysis (IDA):
First general look at the data, without formally answering the research questions.
->purpose: ensure that the later statistical analysis can be performed
Efficiently & to minimize the risk of incorrect or misleading results
-> involve:
– Data background: checking the quality and integrity of the data
– Data structure: what information has been collected?
– Data wrangling: scraping, cleaning, tidying, reshaping, splitting, combining
– Data summary: graphical & numerical
HD EDUCATION
Structure:
– variable: data with p variables == have dimension p
– Based on number of variables
– In R: dim(data) – Return with [row column]
– str(data) – Return with var type
Qualitative Data & quantitative data
Graphical summary:
Data Type 1 qualitative 2 qualitative data 1 quantitative 2 quantitative 1 quantitative +
1 quanlitative
Graph Simple bar plot Double bar plot Histogram
Scatter plot Comparative
Histogram:
highlight the percentage of data in one class interval compared to another
– Total area = 100% = 1
– Horizontal scale = class interval
– Area of each block = percentage in that class interval
– Height of block = crowding
HD EDUCATION
Usually use density scale:
Height of each block = percentage in the block / class interval= percentage per horizontal unit
For continuous data, use endpoint convention:
– Left-closed or Left-open with Right-closed or Right-open
[18, 21) => left-closed & right-open
freq=F produces the histogram on the density scale.
right=F makes the intervals right-open.
Wrong statements:
Block height not equal to percentages, block area = percentage
Use too many class intervals
– Usually use 10-15 subintervals
Box plot: useful for comparing multiple dataset [more on later]
Scatter plot: plot possible relationship between 2 quantitative variables [common use x and y]
HD EDUCATION
– If scatter plot looks random, there does not appear to be a relationship between x and y
Numerical summary:
Advantage of numerical summaries:
Reduce all data to 1 simple number, allow easy communication and comparison
But loss a lot of information
Main features:
-Centre (mean, median)
-Spread (standard deviation, range IQR]
Average of total
In R: mean(data)
Middle data point
Odd size data: median=middle data point
Even size data: median=average of 2 middle points
In R: median(data)
Interquartile range (IQR) = Q3 – Q1
Lower range = Q1 – 1.5*IQR Upper Range = Q3 + 1.5*IQR
HD EDUCATION
Skewed and symmetric distribution:
Robustness and Comparisons:
– Median is robust and a good summary for skewed data as it is not affected by outliers
Which is optimal for describing center?
– Mean and median have strengths and weakness depending on the data
– As the median is robust, it is preferable for data which is skewed or has many outliers
– Mean is helpful for data which is basically symmetric, which not too many outliers, and for theoretical analysis.
– Root mean square (RMS): = 𝑀𝑒𝑎𝑛 𝑜𝑓 (𝑛𝑢𝑚𝑏𝑒𝑟)
HD EDUCATION
Standard deviation in terms of RMS:
Reason for difference between popsd and sd:
样本和总体相比,拥有的数据更少,而且样本更又可能把较为极端的数值排除在外,所以数值更有可能是聚集在均
值周围,这样计算出的样本的标准差就比总体的标准差要小了所以,为了更好的用样本估计总体的标准差,当用样
本做估计的时候, 要把公式除以 n-1, 而不是 n 这样样本差就会变大,就能弥补样本和总体的数据的差异。、
Rule of thumb (normal model):
IQR is another measure of spread.
IQR = range of the middle 50% of data = Q3-Q1,
HD EDUCATION
– Q1 is 25% percentile and Q3 is 75% percentile
– Q2 = 50% percentile = median
In R: IQR(data), quantile(data), summary(data)
Report in pairs: (mean, SD) or (median, IQR)
Coefficient of variation:
Combine mean and standard deviation into 1 summary
CV = mean/standard deviation
higher the CV, the data is more speard
Modules 2: Modelling Data
– Normal model
Calculate values in R
When to use Normal model
– Linear Regression
Correlation coefficient
Prediction
– Non-linear models
Normal Model:
Normal curve: can be seen from histogram.
Two features: – Fairly symmetric – Bell-shaped
Two types:
1. Standard normal curve(Z)
2. General normal curve(X)
Mean = any number
SD = any number
HD EDUCATION
The area under normal curve ≈ area under histogram ≈ the percentage compared with total
≈ the probability of < value or > value
Standard normal curve
Location Lower tail Upper tail Interval
code pnorm(0.8) pnorm(0.8, lower.tail=F) pnorm(0.8)-pnorm(0.3)
General normal curve:
pnorm(value to be test, mean, sd)
Special property:
1. 68,95,99.7 rules
2. Convert general normal to standard normal:
Z-score = (x-mean)/SD
HD EDUCATION
WHEN to use normal model:
1. Normal curve can be seen from histogram
2. Satisfy the thumb rule (68% / 95% / 99.7%)
3. QQ plot look like a straight line
4. Shapiro test:
If p-value < 0.05, normal curve not fit If p-value > 0.05, can use normal curve
Measurement error:
1. Chance error:
– replicate the measurement under same condition -> estimate: calculate the sample SD
=> smaller SD means data is more highly related, with less chance error
Individual measurement = exact value + chance error
2. Outlier [extreme measurements]: if x < mean – 3SD or x > mean + 3SD. Mean and SD may be strongly impacted by
outliers => The histogram will not follow the Normal curve well.
[Related with box plot, outlier is defined as empty circle outside of LT or UT]
3. Bias: systematic error (constant value to each measurement- can be deliberate or accidental
– cannot be estimated by replication the measurements
Individual measurement = exact value + chance error + bias
HD EDUCATION
Increase precision -> reduce change error
Increase trueness -> reduce bias
Linear model:
Scatter plot: plot the bivariate variable
Bivariate data: involves a pair of variables. We are interested in the relationship between x and y.
-x: independent variable / explanatory variable / predictor / regressor
-y: dependent variable / response variable
Linear association: describe how tightly the points cluster around a line.
-If points are tightly clustered around a line > strong association
– May have positive association / negative association
Summaries of a scatter plot:
Correlation coefficient:
Correlation coefficient r is a numerical summary which measures the clustering around the line.
It indicates both the sign and strength of the linear association.
The correlation coefficient is between -1 and 1.
If r is positive: the cloud slopes up.
If r is negative: the cloud slopes down.
As r gets closer to ±1: the points cluster more tightly around the line.
Wrong statements
r= 0.8 means that 80% of the points are tightly clustered around the line => FALSE
r=0.8 means that the points are twice as tightly clustered as r =0.4 => FALSE
HD EDUCATION
they actually give the same result
Sample: cor(x,y)
Population: cor(x,y)*n-1/n
Properties of correlation coefficient:
– When r = ±1, all the points lie on a line (no cloud, perfect correlation)
-The correlation coefficient is not affected by interchanging the x and y variables
=> cor(x,y)=cor(y,x)
-the correlation coefficient is shift and scale invariant.
1. Outliers can overly influence the correlation coefficient
2. Nonlinear association cannot be detected by correlation coefficient (may be a non-linear model)
3. The same correlation coefficient can arise from very different data
4. Rates of averages tend to inflate the correlation coefficient
HD EDUCATION
5. Association is not causation
6. Small SDs can make the correlation look bigger
connect two points:(mean x, mean y) and (mean x + SD x, mean y + SD y)
-The SD line goes through the point of averages.
-A father-son pair where both are 0.5 SDs above the mean would lie on the SD line.
As the SD line does not use r:
-Insensitive to how tightly the points are clustered – >need better for prediction.
-At the extremes with positive / negative correlation, the. SD line will over-estimate in RHS/ LHS and under-estimate
in LHS/RHS
Regression Line:
Connect: (mean x, mean y) & (mean x + SD x, mean y + r*SD y)
For prediction, the Regression line is better than the SD line as it uses all 5
numerical summaries for the scatter plot. It is a smoothed version of the graph of averages.
The regression effect is observed in test- retest situations, leading to the regression fallacy.
Prediction:
1. Baseline prediction: use the mean y over all x values
2. Vertical strip: use the mean_y on that x value
3. Use Regression line: use line formula from R
HD EDUCATION
4. Normal approximation
1. Extrapolating: if we want to make a prediction from an x value which is not within the range in dataset, the prediction
would be completely unreliable.
2. Not check scatter p1ot: always check the scatter plot first to see if linear association exists.
Reason: we may have a high correlation coefficient and then fit regression line, but data may be not linear.
Residual Plot:
Vertical distance (or ‘gap’) of a point above and below the regression line.
– Represents the prediction error in regression line.
In R: res=Data-lm$ fitted value
RMS error:
– Average gap between points and regression line => Average of the residuals
– Like a “standard deviation for the line” => measure if the line predicts well or not accurate
In R: RMS error =sqrt(mean(res^2))
For baseline prediction: RMSerror = SDy
– Since in baseline prediction, we use y for every value of x
-The SD of line = SD of y
For normal prediction:
Special cases:
HD EDUCATION
r=±1, RMS error= 0 =>perfect correlation
r = 0, RMS error= SDy => the regression line is no help in predicting y
Residual plot:
Plot residuals vs x
The residual plot is a diagnostic for seeing whether a linear model was appropriate.
If it is random (no pattern), then linear model seems appropriate for the data.
Summary of Linear and Non- Linear models:
1. Simple linear model (straight line)
2. Quadratic model
3. Cubic model
4. Exponential model
5. Von richter
6. Other models
-Exponential growth
HD EDUCATION
-Exponential decay
– Logistic growth
HD EDUCATION
Module 3 – Sampling Data
-Understand chance
Probability
-Chance variability
Model chance variability by box model
-Sample survey
Model change variability in sample survey
Basic properties of chance:
Chances are between 0% (impossible) and 100% (certain)
P(A)=1 : certain
P(A)=0 :impossible
Complement event: the chance that a certain outcome does not occur
P(event) = 1 – P(complement event)
Drawing at random: each object in a collection have the same chance of being picked
Prosecutor’s fallacy: A mistake in statistical thinking, whereby it is assumed that the probability of a random match is
equal to the probability that the defendant is innocent.
Conditional Probability:
The probability of two events is multiplicative
P(Event 1 and Event 2) = P(Event 1) * P(Event 2|Event 1)
HD EDUCATION
Independence and Dependence (Multiplication Rule):
Independent events:
P(Event 2 | Event 1) = P(Event 2)
Occurs in probability experiments with replacement Unconditional probability
Drawing randomly with replacement ensures independence.
The probability of both events happening
P(Event 1 and Event 2) = P(Event 1) * P(Event 2)
Dependent events:
P(Event 2|Event 1) != P(Event 2)
Occurs in probability experiments without replacement Conditional probability
Drawing without replacement ensures dependence.
The probability of both events happening
P(Event 1 and Event 2) = P(Event 1) * P(Event 2|Event 1)
Mutually Exclusive and Non-mutually Exclusive events (Addition Rule):
Mutually Exclusive events: P(Event 1 or Event 2) = P(Event 1) + P(Event 2)
Non-mutually Exclusive events: P(Event 1 or Event 2) = P(Event 1) + P(Event 2) – P(Event 1 & 2)
HD EDUCATION
Methods of Calculating Probability:
1. Making Lists:
Summaries the outcomes or calculate probability for equally likely events
a. Write a list of all outcomes
b. Count favorable outcomes
2. Tree Diagram
3. Simulate in R
Provides experimental probability
Experimental probability will approach theoretical probability as number of repetitions approaches infinity
Code: sample(1:6, 1000, rep=T)
HD EDUCATION
Binomial model:
Permutations (order matters):
With replacement:
Choosing r of something that has n different types, then
Without replacement:
Choosing r of something that has n different types, then
𝑃(𝑛, 𝑟) = 𝑛𝑃𝑟 =
Combinations (order doesn’t matter):
Without replacement:
Choosing r of something that has n different types, then
𝐶(𝑛, 𝑟) = 𝑛𝐶𝑟 =
𝑟! (𝑛 − 𝑟)!
Binomial coefficients:
𝐶(𝑛, 𝑟) = 𝑛𝐶𝑟 =
𝑟! (𝑛 − 𝑟)!
Binomial model:
Binary trials: where only 2 things can occur,
P{event} = p
P{not event) = 1 – p
Binomial theorem: binary trails, n independent, with P(event) = p at every trial, n is fixed
In R: dbinom(k, n, p) => occur k times, total n, probability p
Chance variability:
Chance process: observed = expected + change error
As experiment time increase
-> Absolute size of change error increase
-> Absolute percentage size of change error decrease
HD EDUCATION
The Law of Large Numbers (Law of Averages):
The proportion of heads becomes more stable as the length of the simulation increases and approaches a fixed number
called the relative frequency.
Assumes that in each trial there is the same chance of each outcome occurring (with replacement)
As the number of trials increases, the proportion of the event will converge to the theoretical or expected proportion
The Box (Population) Model:
Describes the chance of generating a number
Provides information on:
Number and type of all outcomes
Proportion of each outcome
Number of trials
For the sum of random draws from a box model with replacement:
Expected value of sum (EV sum) = number of draws × mean of box
Square Root Law: Error of sum (SE sum) = √(number of draws) x SD of box
As the box represents a population, SD = SDpop
SD of the box:
1. Use formula as before RMS (gaps) = root of Mean(squared gaps)
2. In R: popsd()
3. For a binary box (only two quantitative outcomes, aCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com