COMP2420/6420
Data Management, Analysis and Security Data analysis I, II and Experimental Thinking – 1
Health and Biosecurity CSIRO Feb 2022
Copyright By PowCoder代写 加微信 powcoder
What is ”Statsitics”
The science of collecting, describing and analyzing data.
Effective ways to collect data, methods to describe data and ways of analyzing data
Data collection
The primary and most important step for research, irrespective of the field of research.
The approach of data collection is different for different fields of study, depending on the required information
Data collection
Data structure
ID Gender Age · · · GPA
1 M 2 F . .
21 ··· 3.1 20 ··· 2.5 . . .
10 M 25 ··· 3.7
Case(unit): the subject that we obtain information about
Variable(s) : any characteristic that is recorded for each case
Data collection
Types of data
There are two types of data: categorical and numerical Categorical data – nominal and ordinal data
Numerical data – continuous and discrete data
Categorical(Qualitative) or numerical(Quantitative) variables depend on the types of data
Data collection
Explanatory and response variables
If we are using a variable to help us to understand or predict values of another variable, we call the former the explanatory variable and the latter the response variable.
Explanatory variable: the independent or predictor variable
Response variable: dependent or outcome variable
Data collection
A researcher believes that the origin of the beans used to make a cup of coffee affects hyperactivity. He wants to compare coffee from three different regions: Africa, South America, and Mexico.
Explanatory variable: the origin of coffee bean
Response variable: hyperactivity level
Data collection
Sampling from Populations
Often have questions concerning large populations
Usually, it is not feasible to gather data for an entire
population.
Often gather information from a smaller subset of the population: sample
Data collection
Statistic and Parameter
A measure concerning a sample (e.g., sample mean): statistic A measure concerning a population (e.g., population mean):
Using data from a sample to gain information about the population: Statistical inference
Data collection
Educational policy researchers randomly selected 400 teachers from the National Science Teachers Association database of members and asked them whether or not they believed that evolution should be taught in public schools. They received responses from 252 teachers
Population: all NSTA members, Sample: the 252 respondents
Describing data
Categorical variables: univariate
Twenty-five students are surveyed about their web browser preferences. The categories to choose from are coded as (1) Internet Explorer (2) Firefox (3) Google chrome (4) Others. The raw data is
3411343313212123231111431
Summaries this data?
Describing data
Categorical variables: univariate
Categorical variable can be summarized by tables or graphically with barplots, dot charts and pie charts.
Web browser 1 2 3 4 Frequencies 10 4 8 3
Describing data
Categorical data – Example
> webbrowser = c(3,4,1,1,3,4,3,3,1,3,2,1,2,
1,2,3,2,3,1,1,1,1,4,3,1)
> webbrowser
[1] 3 4 1 1 3 4 3 3 1 3 2 1 2 1 2 3 2 3 1 1 1 1 4 3 1
> is.numeric(webbrowser)
> webbrowser = as.factor(webbrowser)
> table(webbrowser)
webbrowser
1234 10 4 8 3
Describing data
Categorical data – Example
> barplot(table(webbrowser),xlab=”Web Browser”,
ylab = “Frequency”)
1234 Web Browser
0 2 4 6 8 10
Describing data
Numerical variable
What is the general shape of the data?
Where are the data values centered? (what is the central
tendency?)
How do the data vary? (how spread out are the values?)
Describing data
Numerical variable: Shape
Using graphical displays such as Dotplots and histogram we can look at the shape of the data
A distribution can be described in terms of symmetry and skewness.
Describing data
Numerical variable: central tendency
A numerical descriptive measure that locates the centre of a distribution of measurements or describes ’typical value’
Most common measures of centre: 3M (Mode, Median, Mean)
Not only simplify a description of the data but also comparing different data quantitatively
Describing data
Numerical variable: central tendency
Mode: the observation in the dataset that occurs most often (i.e., has the highest frequency of occurrence.)
Median: the middle number in an ordered dataset.
Mean: the arithmetic average of all the measurements in the dataset.
Describing data
Numerical variable: central tendency
Example. Find the 3M of the following sample of dataset:
X = {18, 19, 18, 20, 18, 18, 20, 21, 37, 18} Mode : 18
Median : 18.5 Mean : 20.7
Describing data
Numerical variable: variability
A single value to measure the internal variation of the data – which data items vary from one another or from a central point.
Three of the more commonly used measures of variability: Range, Variance, Standard deviation
Describing data
Numerical variable: variability
Range: the difference between the largest and the smallest values in the data (the simplest one)
Variance: a single value obtained by summing the squares of the deviations from the mean and dividing this sum by
(n − 1), n is the sample size
Standard deviation: the square root of the variance
Describing data
Numerical variable: variability
Example. Consider the following sample of data: x = {10, 12, 15, 17, 21}
The sample mean is 10+12+···+21 = 15 5
10 12 15 17 21
( x − x ̄ )
-5 -3 0 2 6
( x − x ̄ ) 2
25 9 0 4 36
Thevarianceofx is74/(5−1)=18.5and
Example – cont.
How to get the mean and variance (standard deviation in R)
> x = c(10,12,15,17,21) # input data as a vector
> xbar = mean(x)
> x – xbar [1]-5-3 0 2 6 > (x-xbar)^2 [1]25 9 0 436 > sum((x-xbar)^2) [1] 74
# calculate the mean directly
> sum((x-xbar)^2) / (length(x) – 1)
> var(x) # calculate the variance directly
Describing data
Five number summary
The five number summary gives you a rough idea about what the dataset looks like and includes 5 values (items):
The minimum (min) and the maximum (max)
The first quartile (25%), the median (50%), and the third quartile (75%)
Example – cont.
Five number summary
> x = c(10,12,15,17,21)
> length(x)
> sum(sort(x)[3])
> median(x)
> quantile(x, c(0.25,0.5,0.75))
25% 50% 75%
12 15 17
> summary(x)
# the median the hard way
# easy way
Min. 1st Qu. Median Mean 3rd Qu. Max.
10 12 15 15 17 21
Describing data
Graphical summaries of the data
Categorical (or Qualitative) data: pie charts, bar charts and dotplots – easily grasp the distribution of the data quickly
Numerical (or Quantitative) data: boxplot, histogram and density curve
Describing data
Graphical summaries of the data
Numerical data are summarized by graphically with histogram, boxplot, and density curve.
Histogram: constructed by binning the data and counting the number of observations in each bin
Density plot: thought of as plots of smoothed histogram
Boxplot: visualization of five number summary (shown above)
Numerical data – Histogram and density plot in R
The dataset faithful – recording the waiting time between eruptions of Old Faithful.
> data(“faithful”)
> faithful[1:5,]
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
> hist(faithful$waiting)
Example – cont.
Histogram and density plot in R
> hist(faithful$waiting)
> hist(faithful$waiting, breaks = 5)
> hist(faithful$waiting, probability = TRUE)
> lines(density(faithful$waiting), col = “red”)
# 5 breaks
Histogram of faithful$waiting
Histogram of faithful$waiting
Histogram of faithful$waiting
40 50 60 70 80 90 100
faithful$waiting
40 50 60 70 80 90 100
faithful$waiting
40 50 60 70 80 90 100
faithful$waiting
0 10 20 30 40 50
0 20 40 60 80
0.00 0.01 0.02 0.03 0.04
Example – cont.
Histogram and density plot in R
A histogram and density plot show
Overall pattern: where is the centre of the data, what is its spread, what is the shape of the spread
Unusual differences: a point lying away from main part of pattern is an outlier
Example – cont.
Boxplot in R
A visual representation of the five number summary is a box plot.
> boxplot(faithful$waiting)
> boxplot(faithful$waiting, notch = TRUE)
> boxplot(faithful$waiting, horizontal = TRUE)
50 60 70 80 90
50 60 70 80 90
50 60 70 80 90
Example – cont.
The inter-quartile range (IQR): the centre spans the quartiles (Q1 to Q3)
If the median is near the centre, the distribution is reasonably symmetric.
Upper whisker, Q3 + 1.5 × IQR, and lower whisker, Q1 − 1.5 × IQR: outside this range -> outlier
Describing data
Pros and cons of histograms and boxplots
Can handle large data
Shape and outliers
Disadvantage
Histogram Information lost due to grouping
Effect of bin width
Can’t see multimodality Don’t know the sample size
Describing data
Two categorical variable
Bivariate, categorical data is often presented in the form of a
contingency table
Counting the occurrences of each possible pair of levels and placing the frequencies in each cell
Can focus on the relationships by comparing the rows or columns.
=⇒ Later, statistical tests will be developed to determine whether there is any association between variables.
Describing data
Two categorical variables – Example
Will students who performed well in last semester perform well in this semester? That is, is past performance an indicator of future performance?
The dataset grades contains the grades students received in a math class and their grades in a previous math class.
Describing data
Two categorical variables – Example
> data(grades)
> grades[1:5,]
prev grade
1B+ B+ 2A- A- 3B+ A- 4FF 5FF ….
# data grades from UsingR package
Describing data
Two categorical variables – Example
> table(grades$prev, grades$grade)
> table(grades)
grade prev A
A- B+ B B- C+ C D F Sum 3 1 4 0 0 3 2 028 1 1 0 0 0 0 0 0 5 2 2 1 2 0 0 1 1 9 1 1 4 3 1 3 0 2 15 1 0 2 0 0 1 0 0 4 1 0 0 0 0 1 0 0 3 0 0 1 1 3 5 9 7 27
C 1 D0001004319 F 1 0 0 1 1 1 3 4 11 22
Sum 21 9 5 14 7 5 20 19 22122
Describing data
Two categorical variables – Example
Table shows that the current grade relates quite a bit to the previous grade.
Of those students whose previous grade was an A, fifteen got an A in the next class.
Marginal distribution of the data shows two similar distribution.
Describing data
Graphical summaries of two-way contingency table
Barplots can be used effectively to show the data in a two-way table.
A A− B+ B B− C+ C D F
0 5 10 15 20
Describing data
Two numerical variables
Scatterplots, correlation and simple linear regression can be used to analyse two numerical variables.
Scatterplot: graphical representation
Correlation: a measure of the direction and strength of the relationship between two numerical variables.
Describing data
One categorical and one numerical variable
Often we want to compare groups in terms of a quantitative (numerical) variable.
Example – we want to compare the heights of males and females. In this case height is a quantitate variable while biological sex is a categorical variable.
Analyzing data
Comparing independent samples
In many situations we have two samples that may or may not come from the same population. When two samples are drawn from populations in such a manner that knowing the outcomes of one sample doesn’t affect the knowledge of the distribution of the other sample; they are independent samples.
Analyzing data
Comparing independent samples
We want to compare the means of two populations: selecting a random sample from each of the two populations
Population 1 Population 2
Sample 1 Sample 2
Analyzing data
Matched pairs or dependent samples
Before (or Pre)
After (or Post)
Analyzing data
Independent two sample t-test
If the populations are normally distributed or nearly so, and want to compare the mean of one population with the mean of another population, then a t-test can be used (cf. nonparametric Wilcoxon test).
Analyzing data
Comparing independent samples – Example: Differing dosages of AZT
AZT was the first FDA-approved antiretroviral drug used in the care of HIV infected individuals. High dosage cause more side effects. But are they more effective? (A study done in 1990 compared dosages of 300 mg, and 600 mg (1500 mg) (source http://www.aids.org).)
Analyzing data
Comparing independent samples – Example: Differing dosages of AZT
As the p24 antigen can stimulate immune responses, the level of p24 was measured. The measurement of p24 levels for 300 mg and 600 mg is given below:
300 mg 284 300 mg 295 600 mg 298 600 mg 335
279 289 292 287 285 279 306 298 307 297 279 291 299 300 306 291
Comparison of the means of two populations
density.default(x = x)
270 280 290 300 310 320
N = 10 Bandwidth = 4.238
0.00 0.01 0.02 0.03 0.04
Analyzing data
> x = c(284, 279, 289, 292,287,295,285,279,306, 298)
> y = c(298,307,297,279,291,335,299,300, 306,291)
> plot(density(x))
> lines(density(y),lty=2,col=”red”)
> t.test(x,y)
Sample t-test
data: x and y
t = -2.034, df = 14.509, p-value = 0.06065
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-22.3557409 0.5557409
sample estimates:
mean of x mean of y
289.4 300.3
Hand-on practice
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com