CS代考 COMP2420/6420

COMP2420/6420
Data Management, Analysis and Security Data analys I, II and Experimental Thinking – 2

Health and Biosecurity CSIRO Feb 2022

Copyright By PowCoder代写 加微信 powcoder

Bivariate data
Matched-pairs or dependent samples – Example: Pre- and post-tests
Suppose we would like to know whether the Comp2420 class make students understand better about the ”data management and security”?
One way to assess the effectiveness of the class is with a pre-test and a post-test.

Bivariate data
Matched-pairs or dependent samples – Example: Pre- and post-tests
Let assume that we have 10 students’ pairs (pre- and post-) of scores:
Pre 77 56 64 60 57 53 72 62 65 66 Post 88 74 83 68 58 50 67 64 74 60

Bivariate data
Matched-pairs or dependent samples
50 60 70 80
−5 0 5 10 15 20

Bivariate data
Relationship in numeric data
Consider a bivariate data with natural pairing, (x1, y1), . . . , (xn, yn). A correlation problem arises when we ask ourselves whether there
is any significant relationship between a pair of variables.
For example, is there any significant relationship between stock price and the inflation rate, between temperature and electricity use?

Bivariate data
Relationship in numeric data
Correlation and regression analysis use to determine both the strength and the nature of a relationship between two variables:
􏰀 Correlation: provide a measure of the relative strength of the relationship
􏰀 Regression: develop a mathematical equation that relates the variable of interest to the other variable(s).

Bivariate data
Relationship in numeric data
A scatterplot is a good place to start when investigating a relationship between two numeric variables.
A scatterplot plots the values of one data vector against another as points (x,y) in a Cartesian plane.

Bivariate data
Relationship in numeric data – Example: Kids’ weights
The data kid.weights contains height and weight for children ages 0 to 12 years.
> kid.weights[1:3,]
age weight height gender
158 38 38 M 2103 87 43 M 387 50 48 M …
What is the relationship between height and weight?

Bivariate data
Relationship in numeric data – Example: Kids’ weights
10 20 30 40 50 60 10 20 30 40 50 60
kid.weights$height kid.weights$height
kid.weights$weight
20 40 60 80 100 120 140
kid.weights$weight
20 40 60 80 100 120 140

Bivariate data
Correlation
The correlation between two variables numerically describes positive or negative relationships between variables.
−2 −1 0 1 2 3 −2 −1 0 1 2
−2 −1 0 1 2
−2 −1 0 1 2

Bivariate data
Correlation and regression
Both correlation and regression show how two (or more) data are interrelated. However, we hope if there is a relationship between pairs of data, it can be used to assist in making predictions.
One of the primary use of regression is to make predictions for the response value for new values of predictor.
Correlation coefficient can’t but regression can !!

Bivariate data
Correlation and regression
As mentioned previous slides, the regression model is based on the mathematical equation between pairs of data, X and Y :
􏰀 Y =a+bX: juststraightline!!
􏰀 Y =a+bX+error: sothatthiscoversallY
􏰀 Yˆ = a + bX: redefine! Therefore, error = Y − Yˆ

Bivariate data
Correlation and regression – Example: Kids’ weights(linear)
> lm.weight = lm(height ~ weight, data = kid.weights)
> summary(lm.weight)
lm(formula = height ~ weight, data = kid.weights)
Residuals:
Min 1Q Median 3Q Max
-21.8710 -3.4728 -0.1625 3.8447 15.5698
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.88031 0.70955 32.25 <2e-16 *** weight 0.35545 0.01553 22.88 <2e-16 *** Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 6.081 on 248 degrees of freedom Multiple R-squared: 0.6786, Adjusted R-squared: 0.6773 F-statistic: 523.6 on 1 and 248 DF, p-value: < 2.2e-16 Bivariate data Correlation and regression - Example: Kids’ weights > lm.weight1 = lm(weight ~ heightsq, data = kid.weight)
> summary(lm.weight1)
lm(formula = weight ~ heightsq, data = kid.weights)
Residuals:
Min 1Q Median 3Q Max
-30.053 -6.207 -2.084 2.963 63.405
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1089396 1.6503880 1.884 0.0608 .
heightsq 0.0243591 0.0009804 24.847 <2e-16 *** Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 13.31 on 248 degrees of freedom Multiple R-squared: 0.7134, Adjusted R-squared: 0.7123 F-statistic: 617.4 on 1 and 248 DF, p-value: < 2.2e-16 Bivariate data Correlation and regression - Example: Kids’ weights 10 20 30 40 50 60 0 1000 2000 3000 4000 kid.weights$height kid.weights$heightsq kid.weights$weight 20 40 60 80 100 120 140 kid.weights$weight 20 40 60 80 100 120 140 Bivariate data Correlation and regression R-squared, Coefficient of determination, is used how much the regression model explains the data, for example, in the Kids’weights, the first model explains about 68% of the data while the second model explains about 71%. From the first model, we have correlation coefficient, r = 0.823: r2 = 0.8232 = 0.678 = R2 Bivariate data Correlation and regression - Example Better to do regression analysis rather than correlation analysis ! 􏰀 Correlation coefficient: not much information about pairs data only a single measure of the relative relationship 􏰀 Regression analysis: a lot more information and also making predictions Categorical data Correlation and regression Apple would like to know whether customer preference is the same for 4 different observed with Phone iphone iphone iphone iphone Number of customer selecting the phone 30 phones. A random sample of 113 customers was the following results: Categorical data Correlation and regression Does 4 phones are equally preferable or not? With categorical data, whether the population proportion for each category is as claimed, the Chi-square (χ2) test is used. 􏰀 One categorical data: the χ2-test is often referred to as ’goodness-of-fit’ test 􏰀 Two categorical data: the χ2-test for independence Categorical data Correlation and regression - Example iphone 6 iphone 6 plus iphone 7 plus iphone 8 30 36 25 22 > p = rep(0.25,4)
[1] 0.25 0.25 0.25 0.25 > chisq.test(phone,p=p)
Chi-squared test for given probabilities
data: phone
X-squared = 3.9912, df = 3, p-value = 0.2624

Categorical data
Correlation and regression – Example
As far as phone preference is concerned, there is no indication that any one specific iphone is any better or any worse than any other.

Categorical data
Correlation and regression – Example: Seat-belt-usage
The table shows the seat-belt-usage data. In this data, does the fact that a parent has his/her seat belt buckled affect the chance that the child’s seat will be buckled?
buckled unbuckled
Parent buckled 56 8 unbuckled 2 16

Categorical data
Correlation and regression – Example: Seat-belt-usage
> seatbelt = rbind(c(56,8), c(2,16))
> seatbelt
[1,] 56 8
[2,] 2 16
> chisq.test(seatbelt)
Pearson’s Chi-squared test with Yates’ continuity correction data: seatbelt
X-squared = 35.995, df = 1, p-value = 1.978e-09

Categorical data
Correlation and regression – Example: Seat-belt-usage
There is a relationship or association between parent-seat-belt buckled and child-seat-belt buckled.
The two variables are dependent !!

Experimental Thinking
In 60 Minutes, they had a discussion about ”food coloring and children’s behaviour”. They talked with the experts and afflicted parents about how food coloring is bad and is being phased out in places and why are we not doing it etc.
This is reasonable and studies have shown that coloring can lead to hyperactivity in (some) children.

Experimental Thinking
They got some parents to lend their children (the kids all looked to be around 6-10) and put them in two groups. One group would have a healthy color free afternoon tea and the other group would have an afternoon tea full of colorings.
They tested the children by getting them to do a drawing and some writing both before and after the food

Experimental Thinking
The children with the color free food had very little change while those in the color group there was a marked decrease in competency. Even the kids in the color group were just bouncing off the walls
Sounds like a great demonstration that showed up exactly the concerns that exist about the colorings.

Experimental Thinking
There were some problems in this experiment. What would be?
􏰀 Controlling of the coloring / non-coloring foods 􏰀 Randomisation
􏰀 Any other things?

Experimental Thinking
Example 1 – Three different screens
The trend of the size of mobile device is getting larger and larger these days. Early versions of smart phone size was generally less than 4 inches whereas recent smart phones’ screen sizes are much bigger than 4 inches.
Why then users want to have a bigger screen than the early mobile phones?

Experimental Thinking
Example 1 – Three different screens
In general, what do you do with your smart phone?
games, entertainment, web search, social networking and education.
”Investigate user web-search performance and behaviour on different size of screens.”

Experimental Thinking
Example 1 – Three different screens
what measurements
Research question How many?

Experimental Thinking
Example 1 – Three different screens
What sort of things do we need for our study:
􏰀 measures of search performance and behaviour
􏰀 task: how many and level of difficulty
􏰀 screen size: what size and how many different screen sizes? 􏰀 cohort: target cohort, gender and number of participants, etc

Experimental Thinking
Example 1 – Three different screens
Suppose we have a total of 21 participants, 9 tasks (difficulties of 9 tasks are the same) and 3 different screen sizes.
participants
medium screen
small screen
large screen

Experimental Thinking
Principles of experimental design
Three basic principles behind any experimental design (3R):
􏰀 Randomization: the random allocation of treatments to the experimental unit (object that the response and factors are observed)
􏰀 Replication: the repetition of a treatment
􏰀 Reduce noise (control): control any source of variations

Experimental Thinking
Principles of experimental design
Other important principles behind any experimental design (3B): 􏰀 Blocking: reduced bias from confounding factors and also
directly control for the effects of a factor
􏰀 Balance: each treatment group has the same size
􏰀 Blind: subjects don’t know which treatment they have

Experimental Thinking
Example 1 – Three different screens
Case 1. Allocate 9 participants randomly in each group and they take one of 9 tasks.
What do you think about this? If you think this is not a good idea at all, then why? (based on principles of experimental design).
In other words, what are the good things and bad things?

Experimental Thinking
Example 1 – Three different screens
Case 2. Each participant performs a total of 9 search task (3 tasks on each screen). The orders of tasks and screens are randomly allocated for each participant.
What do you think about this? If you think this is not a good idea at all, then why? (based on principles of experimental design).
In other words, what are the good things and bad things?

Experimental Thinking
Design experiments and analysis
After collecting data based on the experimental design, we analyse data using appropriate statistical methods: analysis of variance (ANOVA), generalized linear models (GLMs), linear mixed effects models (LMM), etc
Depend on not only research question but also the response type!

Experimental Thinking
Example 2 – Pagination vs. scrolling
Most of smart phones are using vertical scrolling for mobile web search. But somehow we are more familiar with horizontal pagination.
”Investigate the two control types, horizontal pagination and vertical scrolling, for mobile search.”

Experimental Thinking
Example 2 – Pagination vs. scrolling
Define the hypotheses prior to conducting experiment.
What could be the research questions?
H1. Users spend less search time with the vertical scrolling than horizontal pagination.
H2. Users shows higher search accuracy with horizontal control. ···

Analysis of variance (ANOVA)
Three assumptions
There are three assumptions for analysis of variance (ANOVA):
􏰀 Normality: the population of response are normally distributed
􏰀 Equal variance: equal variance for all treatments group 􏰀 Independent: random and independent samples

Analysis of variance (ANOVA)
Suppose we wish to compare three population means. Assume that each of the 3 treatments consists of 5 experimental units:
Treatment1 Treatment2 Treatment3
15.1 15.0 14.9 14.8 15.2 x ̄1 =15 s1 =0.16
20.1 20.0 20.0 19.9 20.0 x ̄2 =20.0 s2 =0.07
24.0 24.2 24.1 23.9 23.8 x ̄3 =24.0 s3 =0.16

Analysis of variance (ANOVA)
Treatment1 Treatment2 Treatment3
10.2 34.3 9.8 20.5 0.2
x ̄1 =15 s1 = 12.96
28.1 2.4 12.6 39.2 17.7
x ̄2 =20.0 s2 = 14.18
24.0 9.2 40.1 43.9 2.8
x ̄3 =24.0 s3 = 18.19

Analysis of variance (ANOVA)
Example 1 vs. Example 2
The difference between example 1 and example2 is the within-sample variation. That is, example 1’s within-sample variation is much smaller than example 2.
It is the within-sample variation that plays a major part in deciding whether or not population means differ significantly.

Analysis of variance (ANOVA)
Boxplots for example 1 and example 2
Treatment1 Treatment2 Treatment3 Treatment1 Treatment2 Treatment3
16 18 20 22 24
0 10 20 30 40

Analysis of variance (ANOVA)
ANOVA Example 1 using R
Outputs from R:
> anova(fit.aov)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value
group 2 203.33 101.667 5545.5 < 2.2e-16 *** Residuals 12 0.22 0.018 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Analysis of variance (ANOVA) Multiple comparison ANOVA shows that there was significant differences in means between groups. Then which pairs are significantly different? > TukeyHSD(fit.aov)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = y ~ group, data = aov.data)
Treatment2-Treatment1
Treatment3-Treatment1
Treatment3-Treatment2
diff lwr upr p adj
5 4.771538 5.228462 0
9 8.771538 9.228462 0
4 3.771538 4.228462 0

Analysis of variance (ANOVA)
ANOVA Example 2 using R
Outputs from R:
> anova(fit.aov1)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
group 2 203.33 101.67 0.4358 0.6566
Residuals 12 2799.62 233.30

Analysis of variance (ANOVA)
Multiple comparison
> TukeyHSD(fit.aov1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = y ~ group, data = aov.data1)
Treatment2-Treatment1
Treatment3-Treatment1
Treatment3-Treatment2
diff lwr upr p adj
5 -20.77226 30.77226 0.8643164
9 -16.77226 34.77226 0.6315697
4 -21.77226 29.77226 0.9105096

Analysis of variance (ANOVA)
What kind of ANOVA do we have to use is mainly dependent on not only your research question(s) but also the experimental design you have.
One-way ANOVA, Two-way ANOVA, ANOVA with block structure, and ANCOVA etc..

Hand-on practice

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com