CS代考 HD EDUCATION - PowCoder代写

HD EDUCATION

TUTOR: Natsu

HD EDUCATION

2.点【举⼿】即可与⽼师实时互动
3.问题被解答了还可以【⼿放下】

HD EDUCATION

Population: all the data

Sample: a part from population

Modules 1: Exploring Data

Main content:

– Introduction to R

– Design of experiment

– Controlled experiment

– Observational experiment

– Qualitative data

– Quantitative data

– Graphical summary: type of data of different graph & visualization.

– Numerical summary: main feature of data

– Basic R commands:

– For 1 variable (vector)

– For 2+ variables

HD EDUCATION

Design of experiment:

Domain knowledge – background context aided in understanding and interpreting data

Controlled experiments:

– Treatment group(实验组)

– Control group（对照组）

– Contemporaneous control group: occur at the same time as treatment group

– Historical control group: occur earlier than treatment group

Data is collected by comparing the responses of a control group and treatment group

Potential error can produced by:

Confounding: factors which affect the responses of subjects and data collected, compromising the

ability to make conclusions from the experiment -> cause confusion in interpretation

Bias: affects the ability of the data to accurately measure the treatment effect

Bias Solution

Selection bias -> The 2 groups are not randomly

Confounder -> difference between 2 groups

randomized controlled trial / experiment (RCT):

allocate subject randomly

Observer bias-> subjects or investigators are aware of

the identity of the 2 groups

then get bias in either responses / evaluations

Preform Randomised Controlled Double-Blind Trial

Double-Blind Trial where both the subjects (“single

blind”) and investigators (“double blind”) are not aware

of the identity of the 2 groups.

Placebo: pretend treatment

Placebo effect: an effect which occurs from the subject

thinking they have had the treatment

Consent Bias -> subjects choose whether or not they

take part in the experiment

Withhold treatment for those in the control group or

enforce treatment for those in the treatment group

->ethical issue

HD EDUCATION

Relatively best method of controlled experiment:

– Random allocation

– Double-blind design

– Placebo given

Limitation:

– The use of retrospective databases with insufficient information

– The lack of sufficient sample size to determine whether an effect on depression exists

– Lack of placebo controls and randomisation

– The lack of specific standardized assessments of depression and other behaviors

– Improvement:

– Large placebo controlled trials assessing the effects of isotretinoin on depression would be a scientific advance.

Observational study:

When investigator cannot use randomization for allocation to groups (also can not affect objects)

1. Observational studies cannot establish causation:

– Association may point to causation

– But association does not prove causation

– Example: higher rate of liver cancer among smokers

2. Observational studies can have misleading hidden confounders

Confounding occurs when the seeming effect of the treatment is also caused by other factors.

– Selection bias occurs when some subjects are more likely to be chosen to be in the study than others.

– Survivor bias is when certain types of subjects finish the study.

– Adherer bias occurs when certain types of subjects keep taking the treatment (or placebo)

– Non-adhere bias: as opposed to the adherer bias

Strategy for dealing with confounders:

Make groups more comparable by dividing them into subgroups with respect to confounder.

HD EDUCATION

3. Simpson paradox:

Sometimes there is a clear trend in individual groups of data that disappears when the groups are pooled

Occurs when relationships between percentages in subgroups are reversed when the subgroups are combined

4. Observational studies result from using an historical control.

At this case, time is a confounding variable.

– Controlled experiments need to be performed in the same time period (contemporaneously controlled group)

Initial data analysis (IDA):

First general look at the data, without formally answering the research questions.

->purpose: ensure that the later statistical analysis can be performed

Efficiently & to minimize the risk of incorrect or misleading results

-> involve:

– Data background: checking the quality and integrity of the data

– Data structure: what information has been collected?

– Data wrangling: scraping, cleaning, tidying, reshaping, splitting, combining

– Data summary: graphical & numerical

HD EDUCATION

Structure:

– variable: data with p variables == have dimension p

– Based on number of variables

– In R: dim(data) – Return with [row column]

– str(data) – Return with var type

Qualitative Data & quantitative data

Graphical summary:

Data Type 1 qualitative 2 qualitative data 1 quantitative 2 quantitative 1 quantitative +

1 quanlitative

Graph Simple bar plot Double bar plot Histogram

Scatter plot Comparative

Histogram:

highlight the percentage of data in one class interval compared to another

– Total area = 100% = 1

– Horizontal scale = class interval

– Area of each block = percentage in that class interval

– Height of block = crowding

HD EDUCATION

Usually use density scale:

Height of each block = percentage in the block / class interval= percentage per horizontal unit

For continuous data, use endpoint convention:

– Left-closed or Left-open with Right-closed or Right-open

[18, 21) => left-closed & right-open

freq=F produces the histogram on the density scale.

right=F makes the intervals right-open.

Wrong statements:

Block height not equal to percentages, block area = percentage

Use too many class intervals

– Usually use 10-15 subintervals

Box plot: useful for comparing multiple dataset [more on later]

Scatter plot: plot possible relationship between 2 quantitative variables [common use x and y]

HD EDUCATION

– If scatter plot looks random, there does not appear to be a relationship between x and y

Numerical summary:

Advantage of numerical summaries:

Reduce all data to 1 simple number, allow easy communication and comparison

But loss a lot of information

Main features:

-Centre (mean, median)

-Spread (standard deviation, range IQR]

Average of total

In R: mean(data)

Middle data point

Odd size data: median=middle data point

Even size data: median=average of 2 middle points

In R: median(data)

Interquartile range (IQR) = Q3 – Q1

Lower range = Q1 – 1.5*IQR Upper Range = Q3 + 1.5*IQR

HD EDUCATION

Skewed and symmetric distribution:

Robustness and Comparisons:

– Median is robust and a good summary for skewed data as it is not affected by outliers

Which is optimal for describing center?

– Mean and median have strengths and weakness depending on the data

– As the median is robust, it is preferable for data which is skewed or has many outliers

– Mean is helpful for data which is basically symmetric, which not too many outliers, and for theoretical analysis.

– Root mean square (RMS): = 𝑀𝑒𝑎𝑛 𝑜𝑓 (𝑛𝑢𝑚𝑏𝑒𝑟)

HD EDUCATION

Standard deviation in terms of RMS:

Reason for difference between popsd and sd:

样本和总体相比，拥有的数据更少，而且样本更又可能把较为极端的数值排除在外，所以数值更有可能是聚集在均

值周围，这样计算出的样本的标准差就比总体的标准差要小了所以，为了更好的用样本估计总体的标准差，当用样

本做估计的时候，要把公式除以 n-1，而不是 n 这样样本差就会变大，就能弥补样本和总体的数据的差异。、

Rule of thumb (normal model):

IQR is another measure of spread.

IQR = range of the middle 50% of data = Q3-Q1,

HD EDUCATION

– Q1 is 25% percentile and Q3 is 75% percentile

– Q2 = 50% percentile = median

In R: IQR(data), quantile(data), summary(data)

Report in pairs: (mean, SD) or (median, IQR)

Coefficient of variation:

Combine mean and standard deviation into 1 summary

CV = mean/standard deviation

higher the CV, the data is more speard

Modules 2: Modelling Data

– Normal model

Calculate values in R

When to use Normal model

– Linear Regression

Correlation coefficient

Prediction

– Non-linear models

Normal Model:

Normal curve: can be seen from histogram.

Two features: – Fairly symmetric – Bell-shaped

Two types:

1. Standard normal curve(Z)

2. General normal curve(X)

Mean = any number

SD = any number

HD EDUCATION

The area under normal curve ≈ area under histogram ≈ the percentage compared with total

≈ the probability of < value or > value

Standard normal curve

Location Lower tail Upper tail Interval

code pnorm(0.8) pnorm(0.8, lower.tail=F) pnorm(0.8)-pnorm(0.3)

General normal curve:

pnorm(value to be test, mean, sd)

Special property:

1. 68,95,99.7 rules

2. Convert general normal to standard normal:

Z-score = (x-mean)/SD

HD EDUCATION

WHEN to use normal model:

1. Normal curve can be seen from histogram

2. Satisfy the thumb rule (68% / 95% / 99.7%)

3. QQ plot look like a straight line

4. Shapiro test:

If p-value < 0.05, normal curve not fit If p-value > 0.05, can use normal curve

Measurement error:

1. Chance error:

– replicate the measurement under same condition -> estimate: calculate the sample SD

=> smaller SD means data is more highly related, with less chance error

Individual measurement = exact value + chance error

2. Outlier [extreme measurements]: if x < mean – 3SD or x > mean + 3SD. Mean and SD may be strongly impacted by

outliers => The histogram will not follow the Normal curve well.

[Related with box plot, outlier is defined as empty circle outside of LT or UT]

3. Bias: systematic error (constant value to each measurement- can be deliberate or accidental

– cannot be estimated by replication the measurements

Individual measurement = exact value + chance error + bias

HD EDUCATION

Increase precision -> reduce change error

Increase trueness -> reduce bias

Linear model:

Scatter plot: plot the bivariate variable

Bivariate data: involves a pair of variables. We are interested in the relationship between x and y.

-x: independent variable / explanatory variable / predictor / regressor

-y: dependent variable / response variable

Linear association: describe how tightly the points cluster around a line.

-If points are tightly clustered around a line > strong association

– May have positive association / negative association

Summaries of a scatter plot:

Correlation coefficient:

Correlation coefficient r is a numerical summary which measures the clustering around the line.

It indicates both the sign and strength of the linear association.

The correlation coefficient is between -1 and 1.

If r is positive: the cloud slopes up.

If r is negative: the cloud slopes down.

As r gets closer to ±1: the points cluster more tightly around the line.

Wrong statements

r= 0.8 means that 80% of the points are tightly clustered around the line => FALSE

r=0.8 means that the points are twice as tightly clustered as r =0.4 => FALSE

HD EDUCATION

they actually give the same result

Sample: cor(x,y)

Population: cor(x,y)*n-1/n

Properties of correlation coefficient:

– When r = ±1, all the points lie on a line (no cloud, perfect correlation)

-The correlation coefficient is not affected by interchanging the x and y variables

=> cor(x,y)=cor(y,x)

-the correlation coefficient is shift and scale invariant.

1. Outliers can overly influence the correlation coefficient

2. Nonlinear association cannot be detected by correlation coefficient (may be a non-linear model)

3. The same correlation coefficient can arise from very different data

4. Rates of averages tend to inflate the correlation coefficient

HD EDUCATION

5. Association is not causation

6. Small SDs can make the correlation look bigger

connect two points:(mean x, mean y) and (mean x + SD x, mean y + SD y)

-The SD line goes through the point of averages.

-A father-son pair where both are 0.5 SDs above the mean would lie on the SD line.

As the SD line does not use r:

-Insensitive to how tightly the points are clustered – >need better for prediction.

-At the extremes with positive / negative correlation, the. SD line will over-estimate in RHS/ LHS and under-estimate

in LHS/RHS

Regression Line:

Connect: (mean x, mean y) & (mean x + SD x, mean y + r*SD y)

For prediction, the Regression line is better than the SD line as it uses all 5

numerical summaries for the scatter plot. It is a smoothed version of the graph of averages.

The regression effect is observed in test- retest situations, leading to the regression fallacy.

Prediction:

1. Baseline prediction: use the mean y over all x values

2. Vertical strip: use the mean_y on that x value

3. Use Regression line: use line formula from R

HD EDUCATION

4. Normal approximation

1. Extrapolating: if we want to make a prediction from an x value which is not within the range in dataset, the prediction

would be completely unreliable.

2. Not check scatter p1ot: always check the scatter plot first to see if linear association exists.

Reason: we may have a high correlation coefficient and then fit regression line, but data may be not linear.

Residual Plot:

Vertical distance (or ‘gap’) of a point above and below the regression line.

– Represents the prediction error in regression line.

In R: res=Data-lm$ fitted value

RMS error:

– Average gap between points and regression line => Average of the residuals

– Like a “standard deviation for the line” => measure if the line predicts well or not accurate

In R: RMS error =sqrt(mean(res^2))

For baseline prediction: RMSerror = SDy

– Since in baseline prediction, we use y for every value of x

-The SD of line = SD of y

For normal prediction:

Special cases：

HD EDUCATION

r=±1, RMS error= 0 =>perfect correlation

r = 0, RMS error= SDy => the regression line is no help in predicting y

Residual plot:

Plot residuals vs x

The residual plot is a diagnostic for seeing whether a linear model was appropriate.

If it is random (no pattern), then linear model seems appropriate for the data.

Summary of Linear and Non- Linear models:

1. Simple linear model (straight line)

2. Quadratic model

3. Cubic model

4. Exponential model

5. Von richter

6. Other models

-Exponential growth

HD EDUCATION

-Exponential decay

– Logistic growth

HD EDUCATION

Module 3 – Sampling Data

-Understand chance

Probability

-Chance variability

Model chance variability by box model

-Sample survey

Model change variability in sample survey

Basic properties of chance:

Chances are between 0% (impossible) and 100% (certain)

P(A)=1 : certain

P(A)=0 :impossible

Complement event: the chance that a certain outcome does not occur

P(event) = 1 – P(complement event)

Drawing at random: each object in a collection have the same chance of being picked

Prosecutor’s fallacy: A mistake in statistical thinking, whereby it is assumed that the probability of a random match is

equal to the probability that the defendant is innocent.

Conditional Probability:

The probability of two events is multiplicative

P(Event 1 and Event 2) = P(Event 1) * P(Event 2|Event 1)

HD EDUCATION

Independence and Dependence (Multiplication Rule):

Independent events:

P(Event 2 | Event 1) = P(Event 2)

Occurs in probability experiments with replacement Unconditional probability

Drawing randomly with replacement ensures independence.

The probability of both events happening

P(Event 1 and Event 2) = P(Event 1) * P(Event 2)

Dependent events:

P(Event 2|Event 1) != P(Event 2)

Occurs in probability experiments without replacement Conditional probability

Drawing without replacement ensures dependence.

The probability of both events happening

P(Event 1 and Event 2) = P(Event 1) * P(Event 2|Event 1)

Mutually Exclusive and Non-mutually Exclusive events (Addition Rule):

Mutually Exclusive events: P(Event 1 or Event 2) = P(Event 1) + P(Event 2)

Non-mutually Exclusive events: P(Event 1 or Event 2) = P(Event 1) + P(Event 2) – P(Event 1 & 2)

HD EDUCATION

Methods of Calculating Probability:

1. Making Lists:

Summaries the outcomes or calculate probability for equally likely events

a. Write a list of all outcomes

b. Count favorable outcomes

2. Tree Diagram

3. Simulate in R

Provides experimental probability

Experimental probability will approach theoretical probability as number of repetitions approaches infinity

Code: sample(1:6, 1000, rep=T)

HD EDUCATION

Binomial model:

Permutations (order matters):

With replacement:

Choosing r of something that has n different types, then

Without replacement:

Choosing r of something that has n different types, then

𝑃(𝑛, 𝑟) = 𝑛𝑃𝑟 =

Combinations (order doesn’t matter):

Without replacement:

Choosing r of something that has n different types, then

𝐶(𝑛, 𝑟) = 𝑛𝐶𝑟 =

𝑟! (𝑛 − 𝑟)!

Binomial coefficients:

𝐶(𝑛, 𝑟) = 𝑛𝐶𝑟 =

𝑟! (𝑛 − 𝑟)!

Binomial model:

Binary trials: where only 2 things can occur,

P{event} = p

P{not event) = 1 – p

Binomial theorem: binary trails, n independent, with P(event) = p at every trial, n is fixed

In R: dbinom(k, n, p) => occur k times, total n, probability p

Chance variability:

Chance process: observed = expected + change error

As experiment time increase

-> Absolute size of change error increase

-> Absolute percentage size of change error decrease

HD EDUCATION

The Law of Large Numbers (Law of Averages):

The proportion of heads becomes more stable as the length of the simulation increases and approaches a fixed number

called the relative frequency.

Assumes that in each trial there is the same chance of each outcome occurring (with replacement)

As the number of trials increases, the proportion of the event will converge to the theoretical or expected proportion

The Box (Population) Model:

Describes the chance of generating a number

Provides information on：

Number and type of all outcomes

Proportion of each outcome

Number of trials

For the sum of random draws from a box model with replacement：

Expected value of sum (EV sum) = number of draws × mean of box

Square Root Law: Error of sum (SE sum) = √(number of draws) x SD of box

As the box represents a population, SD = SDpop

SD of the box:

1. Use formula as before RMS (gaps) = root of Mean(squared gaps)

2. In R: popsd()

3. For a binary box (only two quantitative outcomes, aCS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts