EPH505 TF Review
Lecture 1: Getting Started in R
Masters will …
Copyright By PowCoder代写 加微信 powcoder
● Conduct basic operations at the R command prompt
● Create an R-file for portable coding
● Interpret basic elements of the R interface
Version should be >=4.0.0
You can use R for anything, and it’s well-accepted … but it’s your responsibility to know how to apply appropriate methodologies
Command prompt!
Command Prompt
Here, R has not yet registered the input.
Here, R is simply echoing what you inputted. A new command prompt is now ready.
Variable Assignment
When you assign a variable in R, it will suppress the output and an empty command prompt is a good sign!
Now, we can perform basic arithmetic operations on this variable.
Typing in the variable name and hit
ls() tells you the variables stored within R’s memory
R-Files: why do we use them?
● They function like a text editor/word document where large amounts of code can be stored
● You can make changes to your code without receiving output each time you
hit
○ You can execute your code from an R-file: just hit Edit>Execute
● Very useful for when you want to collaborate
○ If all collaborators have the dataset on their end (and make sure working directory is specified
to your computer), then all you have to do is run the R-file code
Example code from R-file → not live
R Miscellany: numbering
● R uses [square brackets] to index variables
● Reads the rows left to right, starting at 1
● Inputting claire[10] identified the 10th element
R Miscellany: over-writing & re-annotating
In R, you can overwrite a variable so that it contains new/different elements
But make sure it was intentional! You don’t want to accidentally lose information
If you accidentally annotate your plot incorrectly, you can…
● Manually exit out of plot
● Type dev.off()
● Or issue new plot() command and
overwrite your old one
R Miscellany: data frames & strings
Data frame: a special 2-dimensional data structure (specific to R) that allows us to mix data-types (e.g. strings & numbers) → different from a matrix because a matrix will only allow same data-types
Strings: any information that is protected by “”
Can be characters or text, as long as it has quotes
Class-type is character
What class-type would class(data$Class) yield? What about class(data$Height)?
Does not support arithmetic! Even if it is a number, wrapped in “”
R Miscellany: nesting
Nesting is embedding one functional expression as the argument of another functional expression
Not nesting
Some helpful basic R commands
○ No arguments ; tells you variables stored in R’s memory
●
○ Pressing this key will “interrupt” processes in R and return user to command prompt
○ No arguments: tells you the contents of current working directory
○ With file path argument: tells you contents of that particular folder
● setwd(“file/folder path”)
○ Will tell R what location on your device you would like to have as your working directory (i.e. the folder where
you are working)
● getwd( )
○ No arguments ; tells you what your working directory currently is
● read.csv(“file path”)
○ Reads data in from csv file and will create a data-frame from it
● head(name of dataframe, # of rows you want to see) ○ Will show you the several rows of choice that you want to see from data-frame
Lectures 2 Masters Will
● Differentiate data among the 5 main (statistical) data-types
● Predict the data-type in R (without even using R!)
● Illustrate data using the proper chart type
Predict the Data Types!
Predict the Data Types!
Nominal Nominal Continuous Continuous Ordinal Discrete Nominal
Dataset from Kaggle: https://www.kaggle.com/datasets/rounakbanik/pokemon
Categorical Data Types
Definition
Arithmetic supported?
Categories have no order
Zipcodes, Names, IDs, Colors, flavors, Sports teams, Gender
Categories have an order
Grade levels, T-shirt sizes, Difficulty level, Amazon ratings, Disease stages, Age Range,
Elements ordered in a single category, ties allowed
Billboard hits, Top 10 leading causes of death,
Ordinal vs. Rank?
● Ordinal: assigns hierarchy to an overall category
● Rank: assigns hierarchy to each observation within a category
○ The rank that is assigned to every element of the dataset is sensitive
to the composition of the dataset
○ Ranks are PRESERVED only if the ranks aren’t changed
Example on next slide!
Rank the pokemon 1-6 with rank=1 being the lightest
● What pokemon is rank=1?
● What generation does the pokemon belong to?
Rank the pokemon 1-6 with rank=1 being the lightest
● What pokemon is rank=1?
● What generation does the pokemon belong to?
NEW CATCH! HOPPIP
GRASS TYPE
Generation
● Now, with Hoppip included in our dataset, what is Bulbasaur’s rank by weight?
Bulbasaur’s rank changed from #1 to #2,
His generation did NOT change
Quantitative Data Types
Definition
Arithmetic supported?
Counting process
Typically integer* (not always!) “How many?”
Number of pokemon, gyms, coffee shops
Continuous
Measurement process
Can be integer, fraction/decimal “How much”
Weight, Height, Temperature,
Warning! Data Presentation vs. Reality
● Data that is shown as a whole number can refer to categorical data
○ Zipcodes, IDs, Gender(1=”Male”, 2=”Female”)
● Data shown as whole numbers can also be continuous
○ Height, Weight often rounded to the nearest whole number but since they are measured they are continuous
● Data shown with decimals can be discrete ○ Seeing 5 pokemon per in 2 hours
shown as a rate of 2.5
● Time is continuous but can be visualized as
ordinal (see right figure)
What’s the big deal with data types?
● Data visualization and statistical analysis (more in Module 2)
Should have used a Scatterplot a line plot
Does NOT make sense here!
How to save a plot?
png(“x.png”) Analogy: Plot Sandwich Plot code HERE
Chart Types
Note: plot(df$x,df$y) is the same as plot(df$y ~ df$y)
Independent Variable (X)
Dependent variable (Y)
Nominal/Ordinal
barplot(table())
Histogram Density
Discrete/Continuous
hist(prob=TRUE)
lines(density())
Scatterplot
Continuous
Continuous
Continuous
plot(line=”l”)
Continuous
Boxplot vocabulary (quick review from Primer)
● Whiskers = Central lines spanning 1.5 IQL from either quartile
● Outliers = Beyond whiskers
● Box = 25th and 75th percentile
● Horizontal line = Median
Predict the class in R!
What did I forget to do?
Double-check
Default class in R:
Vector contains all numbers, no decimals AND vector is read from file
R does NOT “Rank” class
It’s just a NUMERIC type
Vector contains decimals or vector created manually and contains strictly numbers
Vector with string data
Vector with string data that has levels with no particular order
Vector with string data that has ordered levels
Continuous
Need to convert it to factor or ordered factor to do analysis
Categorical
● I create a smaller dataframe with just the first generation Pokemon, which column(s) have changed their class?
No subsetting a dataframe does not change its class.
● How can I change “type1” from character to factor?
● If I created a column called pokemon strength with 3 levels which one would R see as the first level?
Need to include a levels argument for it to order correctly Rather than alphanumerical order
● I decided to round my height_m to whole number, has it changed its class?
No! It’s still numeric because it
Was read in from file AND contained a decimal
Warning! Note these distinctions
● String vs. character?
○ Character is a class, strings are CONTAINED in the
character class
● Character vs. Factor?
○ Characters are just simple strings
○ Factors are seen as categories in R that have
different levels
● Rank vs. Order() vs. Sort()
○ Rank() returns the RANK, either a decimal if there
are ties or a whole number
○ Order() returns INDICES that would yield a sorted
set, has to be a whole number
○ Sort() returns sorted VALUES, has to be a whole
Lecture 3 Masters Will
● Apply measures of centrality in their appropriate contexts
● Calculate group-wise summaries from a raw dataset
● Explain importance of accounting for survey weights and groupings
Measures of Central Tendency
Conceptual Definition
Function in R
VALUE that lies in the middle of the datasets VALUES
mean() sum()/length()
Symmetrical and Single-peaked
DATAPOINT that lies in the middle of the dataset’s SORTED DATAPOINTS (if odd),
median() quantile(x, 0.5)
When NOT symmetrical or NOT singly-peaked
Most populated element
sort(table())
No designated function in R
Categorical data
Report mean +/- standard deviation
What kind of distribution do you expect income will have?
More on this in Lecture 6
Best practice is to check whether there are subgroups here, reporting something would be misleading
Report median and IQR
Report median and IQR
Sometimes reported as Median (25th-75th)
Mean vs. (Median and Mode)
● Mean is an abstraction of data element
● Median (if its odd) and Mode are actual elements in the dataset
Measures of Variation
Definition
Function in R
Average of the squared deviations of the mean
For statisticians-units-squared
Standard Deviation
Square root of the variance
Same units as the mean
Maximum value minus the minimum value
diff(range(x))
Measures of Variation
Definition
Function in R
Interquartile Range
Difference between the upper and lower quartiles
IRQ() quantile(x,0.75)- quantile(x,0.25)
Report with the median
Coefficient of Variation
Mean divided by the standard deviation, unitless
mean()/sd()
No designated function in R
Compare variation between two things that have different units
Sigma Notation – Mean
Sigma Notation – Variance
● Average of the squared deviations of the mean
*“Mean” term
Deviation term
Why do we square the deviations?
Does height or weight have more variation?
Does height or weight have more variation?
Use the Coefficient of Variation to compare two measures with different units. Even though we achieved the same answer, this won’t always be the case, see Problem Set #3
Weighted Mean/Global Mean vs. Mean
● Weighted Mean/Group Mean/Global average = each person in the group represents a proportion of the population
● Mean = each person represents themselves
weighted.mean () has 3 arguments
1. The values you want to take the mean of
2. Weights
3. na.rm=TRUE
Why is weighted mean important?
It saves time, $$$, resources (we don’t have to go out and find 300 million people
Weighted Mean/Global Mean vs. Mean
The doBy Package
Functional notation “as a function of”
Function(s) you want to apply.
If multiple functions, use a vector to list them
Keep.names=FALSE as a default
What happens if you add keep.names =TRUE?
Long format
● Long-format:
○ SINGLE observations as rows, 1 column per variable
Long-format
NOT Long-format
Lecture 4: Probability
Masters Will…
● Detect probability structures in word problems
● Calculate raw and conditional probabilities
● Recognize sampling scenarios that lead to erroneous interpretations
Frequentist vs. Bayesian Statistics
Frequentist
● Definition: If an experiment is repeated n times under essentially identical conditions, and if the event A occurs m times, then as n grows large, the ratio m/n approaches a fixed limit that is the probability of A (Pagano & Gauvreau, pg 127)
○ P(A)=m/n
● Can also be described as the proportion
of times an event occurs
● It is the bedrock of how nearly all
biostatistics are based on this notion of frequentism
Bayes Theorem
● P(B|A) = probability of B occurring, given A has occurred
● P(A) = probability of A occurring
● P(B|not A) = probability of B occurring,
despite A not occurring
● P(not A) = probability of A not occurring
● P(A|B) = probability of A occurring, given
Frequentist Statistics
● If we conduct more and more experiments, the collection of results will converge to the true population result.
● The notion of the sampling distribution is grounded in this concept!
● 100 people go out → each find 10 people → the averages collected by each
of the 100 people, should be closer to the true population mean than each individual average. If plotted, the averages should be approximately normally distributed around the true population mean.
★ Probability = “proportion”
○ Usually indicated by upper-case letters
○ Either occurs, or does not occur
Set Theory Vocabulary
Definition
“A intersect B”
Both events occurred
“A union B”
Either A occurred, or B occurred, or A & B both occurred
“A complement”
Event A not occurring
Create a new vector called b
Create a new vector called a
Finds all of the values where they are the same in both vectors… in other words “intersect” Note: you can also use intersect(b,a) to yield the same output
Finds values where either a,b, or both a and b are included Note: you can also use union(b,a) to yield the same output
Finds the set differences in a relative to b (output is unique values in “b” vector)
Finds the set differences in b relative to a (output is unique values in “a” vector)
Set Operations in R
Raw vs. Conditional Probability
● Definition: P(X)
● What are the chances of X?
● P(X) = m/n
○ Dividing a portion (m) from the total (n)
● Example:
○ What are the chances that an athlete is taller than 72 inches?
Conditional
● Definition: P(X|Y). What are the chances of X, given Y. Alternatively, probability of Y given X
● P(X|Y) = P(X⋂Y) ÷ P (Y)
● The chances that an event occurred, given
that another event already occurred
● Since every probability has some context,
every probability is conditional
● Example:
○ What is the probability of being taller than 72 inches, given that you are a football player?
○ P(h>72 | football)
Berkson’s Paradox / Collider Bias
● Inappropriate sampling causes events to appear correlated when they are actually not
● Example: Are smokers less likely to be
hospitalized?
○ Smokers are more like to be present at hospitals
(perhaps due respiratory illness)
○ However, COVID leads to more non-smokers to be
present at hospital
○ Now, if you took a sample from the hospital during
COVID, you are more likely to find nonsmokers in your sample → and erroneously conclude that smokers are less likely to be hospitalized than non-smokers.
Hospitalization
Lecture 5: Probabilistics
Masters Will…
● Identify conditional probabilities from word problems
● Recognize True/False Positive/Negative from word problems
● Calculate odds ratio and relative risk
Odds & Risk
● Risk (Probability) = m/n
○ The probability of an event (ie. an outcome is frequency of occurrence (m) over a large number of trials (n)
○ Always between 0 and 1
● Odds = P / (1-P)
○ The probability that an event will occur, divided by the probability that the event will not occur
○ Not quite a probability, but derived from probability
Diagnostic Tests
Test Result
Disease (D)
No Disease – Healthy (H)
Positive (+)
TP – Sensitivity P(+|D)
Negative (-)
TN – Specificity P(-|H)
● True-Negative: Test correctly identifies healthy person (Specificity)
● True-Positive: Test correctly identifies sick person (Sensitivity)
● False-Negative: Test mistakes sick person as healthy
● False-Positive: Test mistakes healthy person as sick
Another Helpful Illustration…
Lecture 6, slide 28
Back to Bayes Theorem!
# of existing cases/total population
Note: 1-Specificity=False Positive or P(+|H)
Beauty of Bayes
Allows us to revise existing predictions or theories (update probabilities) given new or additional evidence
Relative Risk
● Definition: that a member of a group receiving some exposure will , relative to the that a member of a group that is
➢ RR > 1: increased risk among those w/ exposure (compared to those unexposed)
➢ RR = 1: probabilities of disease risk in groups are identical
➢ RR < 1: decreased risk among those w/ exposure (compared to those unexposed)
Probability
develop the disease
probability
unexposed will develop the disease
Odds Ratio
● Definition: Odds of disease among members of the group, relative to the of disease among members of the group
How to Calculate Odds Ratio from a 2x2 Table
THREE WAYS
Cross-product
Vertically
Horizontally
Simpson’s Paradox
A trend or result that is present when data is put into groups, but or disappears when the data is combined.
different conclusions depending on whether a co-factor is introduced.
At first glance, Option B seems superior (shows efficacy across two studies)
However, if we account for sample size, Option A is actually superior overall!
Count/proportion data can lead to
Lecture 6 - Probability (canonical) Distributions
Masters Will...
• Infer distribution shape or type from knowledge of underlying process
• Convert raw data to standardized and normalized form
• Translate raw data into percentiles, and vice-versa
What is a probability (canonical) distribution
Frequency Distribution
Listing of all the frequencies of outcomes in an experiment that you actually observed during the experiment
Ask 1000 children in : “What is your birth order, if you have any siblings?” and compute your results in the form of a histogram
Probability distribution
Listing of all the probabilities of possible outcomes that could occur if the experiment was done
In our scenario, a probability distribution lists each possible outcome and its corresponding probability. The probabilities would represent the relative frequency of each occurrence, if the sample size was infinite. All possibilities are taken into account, so the sum of all their probabilities would be 1
Binomial Distribution
● How do we compute these distributions? R does most of the work!
dbinom(x, size, prob) - returns the probability of getting a certain number of successes (x) in a certain number of trials (size) where the probability of success on each trial is fixed
● Serge flips a fair coin 20 times. What is the probability that the coin lands on heads exactly 12 times? dbinom(12, 20, 0.5)
● What does dbinom(17,35,0.7) translate to in plain English?
The probability of obtaining exactly 17 events in 35 trials, where the probability of the event occurring is 0.7 (or 70%)
1. Dichotomous = two outcomes (smoker vs. non-smoker) 2. Each trial’s outcome is independent
3. The probability of an outcome is constant in each trial
4. Fixed number of trials
What does the binomial distribution look like
SYMMETRIC LEFT SKEW RIGHT SKEW
Poisson Distribution
Rare events
● The dpois(x,lambda) function finds the probability that a certain number of successes (x) occur based on an average rate of success (lambda)
● It is known that 20 cars drive in front of my house per hour. In a given hour, what is the probability that exactly 16 cars drive by? dpois(16,20)
● Important: Slide 45/65 Lecture 6
1. The average number of successes (μ) that occurs in a specified region is known
2. The probability that a success will occur is
proportional to the size of the region
3. The probability that a success will occur in
an extremely small region is virtually zero
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com