Homework #1: Computational Portion¶
All computations should be done in this notebook using the R kernel. This is your first opportunity to get familiar with R outside of class, so please take your time on the problems that require it. Working in small groups is allowed, but it is important that you make an effort to master the material and hand in your own work.
You will be required to submit this notebook ipynb file, fully compiled with your solutions, and an HTML file to Canvas by 11:59pm on Wednesday, January 29.¶
Read and sign the Honor Code Pledge below:¶
Honor Code Pledge: On my honor, as a University of Colorado Boulder student, I have neither given nor received unauthorized assistance on this work.
TYPE YOUR NAME BELOW:¶
TYPE YOUR NAME HERE
Problem 1¶
(a) Load the adm.txt file into R (this file can be found in Canvas under the “Data” Module)¶
The dataset contains information from six departments at a University. The variables are:
1. dpt = department name (A through F)
2. app = the number of applications received by a department’s graduate program.
3. adm = the number of applications accepted to a department’s graduate program.
4. sex: m = male; f = female
In [ ]:
(b) Write code that extracts and prints the data from department B only.¶
In [ ]:
(c) Write code that prints departments with an admission rate of lower than 40%.¶
In [ ]:
(d) What is the addmission rate for all males at this university (across all departments)?¶
In [ ]:
(d) What is the addmission rate for all females at this university (across all departments)?¶
In [ ]:
(e) Is there a large discrepancy? Could this be evidence of discrimination against female applicants?¶
(f) What is the rate of admission for males, and separately, females, at this university, conditioned on department? Create a data frame with labeled columns in the following order: Department, Female, Male¶
In [ ]:
(g) What do you notice about these results? Are they in tension with the result from (d)?¶
Problem 2¶
Verify the results of theoretical question 3, parts (c) and (d), by simulating 50 data points $x_1,…,x_{50}$ with mean of 5 and st. dev. of 1 using ${\tt rnorm()}$ and then performing the relevant computations. Include #comments in your code to explain what you are doing.
In [ ]:
In part (c) above we show that taking the difference between a sample and that sample’s mean yields a new sample of the same size with the same standard deviation and variance as the original sample.
In [ ]:
In part (d) we show that dividing the difference of the sample and its mean by the sample’s standard deviation gives us a new sample of the same size with standard deviation and variance equal to 1. Note also that this new sample has a mean of approximately 0 (may not be identically 0 due to numerical and floating point error).
Problem 3¶
(a) The data frame VIT2005 in the PASWR2 package contains descriptive information and the appraised total price (in euros) for apartments in Vitoria, Spain. Load the data vit2005.txt into R (this file can be found in Canvas under the “Data” Module)¶
Here are descriptions of the variables:
1. totalprice (the market total price (in Euros) of the apartment including garage(s) and storage room(s))
2. area (the total living area of the apartment in square meters)
3. zone (a factor indicating the neighborhood where the apartment is located with levels Z11, Z21, Z31, Z32, Z34, Z35, Z36, Z37, Z38, Z41, Z42, Z43, Z44, Z45, Z46, Z47, Z48, Z49, Z52, Z53, Z56, Z61, and Z62)
4. category (a factor indicating the condition of the apartment with levels 2A, 2B, 3A, 3B, 4A, 4B, and 5A ordered so that 2A is the best and 5A is the worst)
5. age (age of the apartment in years)
6. floor (floor on which the apartment is located)
7. rooms (total number of rooms including bedrooms, dining room, and kitchen)
8. out (a factor indicating the percent of the apartment exposed to the elements: The levels E100, E75, E50, and E25, correspond to complete exposure, 75% exposure, 50% exposure, and 25% exposure, respectively.)
9. conservation (is an ordered factor indicating the state of conservation of the apartment. The levels 1A, 2A, 2B, and 3A are ordered from best to worst conservation.)
10. toilets (the number of bathrooms)
11. garage (the number of garages)
12. elevator (indicates the absence (0) or presence (1) of elevators.)
13. streetcategory (an ordered factor from best to worst indicating the category of the street with levels S2, S3, S4, and S5)
14. heating (a factor indicating the type of heating with levels 1A, 3A, 3B, and 4A which correspond to: no heating, low-standard private heating, high-standard private heating, and central heating, respectively.)
15. storage (the number of storage rooms outside of the apartment)
In [ ]:
(b) Explore the data by providing a numerical summary of all variables.¶
In [ ]:
(c) Create a frequency table, a piechart, and a barplot showing the numbers of apartments grouped by the variable ${\tt out}$. For you, which method converys the information best? (See ${\tt table()}$, ${\tt pie()}$ and ${\tt barplot()}$)¶
In [ ]:
(d) Characterize the distribution of the variable ${\tt total price}$ using ${\tt hist()}$ and ${\tt boxplot()}$. Is the distribution (i) symmetric/skewed, or (ii) unimodal/multimodal? (iii) Does it have any outliers?¶
In [ ]:
(e) Characterize the relationship between ${\tt totalprice}$ and ${\tt area}$ using ${\tt plot()}$. Don’t forget to title your plot!¶
In [ ]:
(f) Create a boxplot of ${\tt totalprice}$ conditioned on ${\tt toilets}$. Are there any outliers? Does there appear to be a large difference in total price between apartments with one bathroom and apartments with two bathrooms? Title the boxlplot “Boxplot of Total Price By Number of Bathrooms”¶
In [ ]:
(g) Explore the relationships between at least three variable pairs (not explored above) of ${\tt totalprice}$ using the appropriate graphical summaries. Provide the plots and written interpretations! Don’t forgot to title and label all your plots.¶
In [ ]: