INFT6201 – BIG DATA TUTORIAL PROJECT 2

SCHOOL OF DESIGN, COMMUNICATION AND IT

INFT6201 – BIG DATA TUTORIAL PROJECT 2

This tutorial project is based on a dataset from the National Institute of Diabetes and Digestive and Kidney Disease, which is available from the UCI Machine Learning Repository (Lichman, 2013):

https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

EXERCISE 1 (1 MARK)

Use ggplot() to create a box plot that shows the BMI on the y-axis separately for women who have or have not been diagnosed with diabetes. Note: Only include those observations that have a BMI value of greater than 0.

EXERCISE 2 (2 MARK)

Use ggplot() to create a violin plot that shows the TSFT value on the y-axis separately for women who have or have not been diagnosed with diabetes. Use the “Paired” colour palette from the RColorBrewer library to fill the violin plots. Add a boxplot on top of the violin plot and add a point that indicates the mean value. Note: Only include those observations that have a TSFT value of greater than 0.

EXERCISE 3 (1 MARK)

[R-CODE]

Use the subset() command to create a subset of the dataframe that only includes observations with BMI > 0 and TSFT > 0. Name this dataframe “pimadatasub”. Then, using the newly created data frame “pimadatasub”, use the custom winsor() function discussed in the lecture slides in week 3 to create a new variable BMIwinsor based on the variable BMI. Use a multiplier of 1.5.

To make sure that the winsorising worked, compare the two variables by creating simplified box plots using the following commands.

with(pimadatasub, boxplot(BMI)) with(pimadatasub, boxplot(BMIwinsor))

[R-CODE]

[R-CODE]

1/3

EXERCISE 4 (2 MARKS)

Based on the dataset “pimadatasub”, create a new column “agecat” in the dataframe that describes the age category of a person. Distinguish between the following categories: “21 to 30”, “31 to 50”, “46 to 60”, and “61 to 85”. Convert the column into a factor variable using the as.factor() command.

Use ggplot() to create a scatterplot for BMI over TSFT. Indicate the different age categories by colouring the points in the scatterplot with the “GrandBudapest” palette of the “wesanderson” library package.

EXERCISE 5 (1 MARK)

[R-CODE]

Based on the dataset “pimadatasub”, use the ddply() function of the package “plyr” to create a data frame with the means and standard deviations of BMI, TFST, and BMI for the three different age categories (variable: agecat, cf. Exercise 4) and for the two different results of the diabetes test (positive / negative). The output should look like this:

EXERCISE 6 (2 MARKS) [R-CODE]

Based on the dataset “pimadatasub”, use a Bartlett’s test to test for variance homogeneity in the variable DBP across the three different age categories (variable: agecat, cf. Exercise 4). Interpret the results of the test and decide whether we should assume that the variances are homogeneous.

Then, use a one-way Analysis of Variance (ANOVA) to test whether there is a difference in mean DBP across the three different age categories and interpret the result. Conduct a PostHoc analysis to determine which groups are significantly different from each other. How does the result of the test of variance homogeneity affect the PostHoc analysis?

EXERCISE 7 (1 MARKS) [R-CODE]

Based on the dataset “pimadatasub”, compare the number of times a woman was pregnant across the three different agregroups (variable: agecat, cf. Exercise 4). Which test should we use to test whether there is a significant difference and why? Conduct the test in R and interpret the result.

REFERENCES

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[R-CODE]

2/3

DATASET

Salary Pima Indians Diabetes Database

Description

A diabetes dataset. All patients here are females at least 21 years old of Pima Indian heritage. Note: Even though the dataset donors made no such statement, it seems very likely that several values zero values encode missing data for several variables.

Usage

Pimadata

Format

A data frame with 768 observations on the following 9 variables.
timesPregnant Number of times pregnant
PCG Plasma glucose concentration a 2 hours in an oral glucose tolerance test DBP Diastolic blood pressure (mm Hg)
TSFT Triceps skin fold thickness (mm)
insulin 2-Hour serum insulin (mu U/ml)
BMI Body mass index (weight in kg/(height in m)^2)
DPF Diabetes pedigree function. It provides some data on diabetes mellitus

history in relatives and the genetic relationship of those relatives to the patient. This measure of genetic influence gives an idea of the hereditary risk one might have with the onset of diabetes mellitus.

age Age (Years)
diabetes 1 tested positive for diabetes

0 tested negative for diabetes

Source

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

3/3