Section A
This section is for testing your data transformation, ggplot, and base function skills. If it’s neccessary, please deal with the overplotting, or labels on axis/legend properly.
Problem 1
We would like to create one simulated (fake) data frame contained the employer height(cm) information from two companies: Alpha and Beta. Also, this data frame should include the companies’ area codes. (Company may have multiple subsidiaries in different areas)
• Create one column Area Code with 2000 rows only contained 26 upper-case letters (alphabet). These letters should be randomly filled in 2000 rows. (with replacement)
• Create one column Company with 2000 rows contained only two values “Alpha” and “Beta”. To be convenient, first 1000 rows should be “Alpha”s, and last 1000 rows should be “Beta”s.
• Create one column Employee Height (cm) with 2000 rows. To be convenient,first 1000 rows and last 1000 rows should be randomly generated with mean = 160, sd = 5, and mean = 170, sd = 5, respectively.
Then create a density plot on the height, mapping company as the fill. hints: The built-in “LETTERS” contains 26 upper-case letters.
Problem 2
Still working on the previous data frame. For each area, summarize the average employee height of each company. Then plot a dodge bar chart visualizing area code versus the average of height, and mapping company as fill.
Plot Example
2
175
170
165
160
155
150
Company Alpha
Beta
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Area Code
Problem 3
Insert THREE more columns into the previous data frame.
• First column Employee Weight (kg) should be generated with 2000 random variables (mean = 65, sd = 10).
• Second column “BMI” follows the formula: weight(kg)/[(height(cm)/100)ˆ2]
• Third column BMI Categories contains 4 labels “underweight”, “normal weight”, “overweight”,
and “obesity” associated with column “BMI” for each row.
– When BMI <=18.5, "Underweight"
– When 18.5< BMI<= 25, "Normal weight" – When 25< BMI <=30, "Overweight"
– When BMI > 30, “Obesity”
Then create a scatterplot visualizing Employee Height(cm) versus Employee Weight(kg), mapping BMI Categories as color, and facet this plot by Company.
Section B
Section B uses National Health and Nutrition Examination Survey 2015-2016 Demographics Data from Centers for Disease Control and Prevention.
Download NHANES 2015-2016 Demographics data (XPT file) from: https://wwwn.cdc.gov/nchs/nhanes/ Search/DataPage.aspx?Component=Demographics&CycleBeginYear=2015
3
Average Employee Height(cm)
To read the data manual: https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm
The details and introduction for NHNES please click the link: https://youtu.be/GmnN2r5J0YA
Load package “haven”(one of the packages from “tidyverse”), and use read_xpt() to import the dataset to R.
Problem 1
Create a new data frame with the following columns:
• The race information included only Mexican American, Other Hispanic, Non-Hispanic White, Non- Hispanic Black, and other race
• Ratio/value of family income to the poverty line
• Removing the above ratio’s decimals (e.g. 2.61 -> 2) and then make them as categorical data (“Annual
family income value”): 0, 1, 2, 3, 4, and 5
• The proportion of each ethnic families among all families
• The proportion of each ethnic families among all families at each annual family income value: 0, 1, 2, 3,
4, and 5
Then create a bar chart to visualize the annual family income value (x-axis) versus the proportion of Black families among all families at each annual family income value (y-axis). Include a subline whose y value should equal to the proportion of Black families among all families.
Are Black families over- or under-represented in poverty? What else you notice about the chart? hints: When the annual family income value is 0, which means such family is in poverty
Problem 2
Still working on the above data frame.
Then create a bar chart to visualize the annual family income value (x-axis) versus the proportion of Mexican American families among all families at each annual family income value (y-axis). Include a subline whose y value should equal to the proportion of Mexican American families among all families.
Are Mexican American families over- or under-represented in poverty? What else you notice about the chart?
Problem 3
Still working on the above data frame. Select other hispanic families for observation.
Then create a bar chart to visualize the annual family income value (x-axis) versus the proportion of other hispanic families among all families at each annual family income value (y-axis). Include a subline whose y value should equal to the proportion of other hispanic families among all families.
Are other hispanic families over- or under-represented in poverty? What else you notice about the chart?
4