—
title: “Intermediate R Exercises”
author: “
date: “
output: html_document
—
“`{r setup, include=FALSE}
# Chunk 1. Generated knit setting
knitr::opts_chunk$set(echo = TRUE)
“`
## R Markdown
Reminder: This is a Individual Self-Assessment*
Goals:
1. Assess your knowledge of DataCamp’s Intermediate R
2. Prepare for MQM Coding Pre-Requisites Final Exam
3. Prepare R markdown documents
4. Read files
5. Learn good programming techniques (step-wise approach, checking your calculations, not using breaks in loops, using < and ==, etc.)
6. Handle data larger than you can see and remember
7. Find help to learn new things
8. Remind you of cheat sheets.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
*2. Read nycflights into a dataframe flights*
*Note: Numbering corresponds to chunk numbers. Chunk #1 specified knitting*
*Hint: read.csv or read_csv*
*Tip: Normally, you should check the file before you read it in. I skipped that because we are focusing on a narrow subset of the nycflights. However, please make sure that you save nycflights.csv in the same folder/directory as this file.*
“`{r}
“`
*3. Check the type and structure of flights before proceeding.*
*Tip: Always get familiar with the data before you start working on it. The first steps are usually to check the type and structure.*
“`{r}
“`
*4. Print the first 7 rows of flights.*
*Tip: str() is great but it is not so good at visualizing data*
*Hint: head() doesn’t require you to specify the number of rows but you can specify them using optional parameter “n = “. Google head in R if you need help.*
“`{r}
“`
*5. Now, print the (statistical) summary of flights*
“`{r}
“`
*6. Now, let’s analyze dep_delay column of flights by plotting it.*
*Tip: I always recommend graphically visualizing data before proceeding. ggplot2 is widely used in the data science world and is also my favorite visualization package. How can you not love something call ggplot! We won’t explicitly test you on ggplot but learning the syntax/functionality of new modules is a key component of coding.*
*Hint: Load ggplot2 and make sure that it is available using the search() function.*
“`{r}
“`
*7. Now, use qplot to plot distance (x-axis) and dep_delay (y-axis).*
*Tip: If you were solving a real problem, I would recommend plotting several visualizations to guide you (such as origin, destination, etc.). However, we will focus on just a few columns in this exercise.*
“`{r}
“`
*8. Print the number of missing values in dep_delay and distance to locate the missing data.*
*Tip: The warning “Removed 8255 rows…” indicating missing data. This warning lets us know that we should be careful with missing values. Practically, this means that we will check for is.na() which will tell us about the missing data. Remember this when you work on dep_delay (after confirming that that’s where the missing values are located).*
“`{r}
“`
*9. Compute the average dep_delay.*
*Tip: You know that dep_delay has several na values.*
*Hint: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean to handle na values.*
“`{r}
“`
*10. Recompute average dep_delay to get the correct answer.*
*Tip: You likely overlooked that the calculations are inaccurate due to the impact of the flights arriving early.*
*Hint: Find the indices for flightsDelayed, flightsOnTime, flightsEarly. The average of flights[flightsDelayed,”dep_delay”] should be positive. Average of flights[flightsOnTime,”dep_delay”] should be 0. Average of flights[flightsOnTime,”dep_delay”] should be negative.*
“`{r}
“`
*11. Now, check out the number of flights schedule during certain time periods.*
*Compute a logical vector flightsMarchIndices that is TRUE for flights in the March, and FALSE otherwise. Print the structure of flightsMarchIndices. (If you print the entire vector, it will print about 100 rows).*
*Tip: Check your answer by confirming that counting the number of flights in March (should be about 29000).*
*Hint: TRUE equates to 1 and FALSE equates to 0.*
“`{r}
“`
*12. Compute a logical vector flightsQ1Indices that contains TRUE for all in the Q1 (January, February, March) and false otherwise. Print the structure of flightsQ1Indices (because otherwise you will print about 100 rows). Try to do that using just one relational operator.*
*Check your answer by confirming the fraction of flights in Q1 (should be about 0.24). Recall that TRUE equates to 1 and FALSE equates to 0 (so you can compute the average of this vector.*
“`{r}
“`
*13. Compute a logical vector flightsStartEndIndices that contains TRUE for the first 7 days and the last 7 days of the year (and FALSE otherwise).*
*Print the structure flightsStartEndIndices and the fraction of flights with indices in flightsStartEndIndices.*
“`{r}
“`
*14. “Assuming the same number of flights per day throughout the year (i.e. a uniform distribution), what fraction of flights do you expect in flightsStartEndIndices in 2013?*
*Tip: These are 14 days in a year with 365 days.*
“`{r}
“`
*15. How do you explain the difference between the fraction of flights expected based on uniform distribution and the data?*
*Hint: No R Code required for this answer. (Use # indicate that so R can ignore your text.)*
“`{r}
“`
*16. Using conditional flow and loops, write a chunk that computes and prints superlate, late, and notlate.*
*The flights are superlate if they are more than 8 hours late, late if they are up to 8 hours late, and notlate if they are early or not late.*
*Tip: I did this in several steps:*
*1. I counted the rows in flights using counter*
# counter = 0
# for (index in 1:nrow(flights)) {
# counter = counter + 1
# }
# counter
*2. I counted the number of na (not available) values (because I recalled that dep_delays had missing values).*
*3. I counted the superlate.*
*4. I counted the late and notlate.*
*5. I added all values to make sure that I counted all flights.*
*6. I would comment out the extra code after my project is final (but not now).*
*Tip: I used (nalate + superlate + late + notlate == counter) instead of (counter == nalate + superlate + late + notlate). Why? If I make a mistake using = in place of == then R will tell me about the error instead of letting my error go undetected.*
“`{r}
“`
*17. Now, write a function lateCounts based on your code above to classify late flights into superlate, late, and notlate.*
*Your function should not have any parameters. It should return a vector of three numbers in the following order: (superlate, late, notlate).*
*Your function should return (-1, -1, -1) if you detect any error(s).*
*Tip: Don’t forget to call lateCounts() to check if your answer remains the same as the previous chunk.*
“`{r}
# Do not need to edit line below here until the next comment
lateCounts <- function() {
# Do not need to edit line above here until the previous comment
# Your code goes here
# Do not need to edit line below here until the next comment
if (nalate + superlate + late + notlate == counter) {
return (c(superlate, late, notlate))
} else {
return (c(-1,-1,1))
}
}
# Do not need to edit line above here until the previous comment
# Do not edit the line below to test lateCounts
lateCounts()
```
*18. Define a function lateCountsPlus() based on lateCounts() that accepts a required parameter (delays, a numeric vector) and two optional parameters (cutoffs called superlate and late).*
*The default values of superlate and late should be set to 480 and 0, respectively.*
*Your function should return (-1, -1, -1) if you detect any error(s).*
*Tip: I will help you breaking down this complicated task into smaller steps. This is known as stepwise refinement in computing.*
*Hints (based on how I solved this):*
*Step 1: Add a parameter delays (that correspondence to the dep_delay column in flights).*
*Step 2: Add optional parameters superlate and late, defaulted to 480 and 0, respectively.*
*Step 3: Change any code to adapt to the parameters.*
*Step 4: Start testing!*
*Step 4.1: Call the function setting only delays (to flights dataframe's dep_delay parameter). Do not set superlateLimit and lateLimit (so they are defaulted to 480 and 0, respective). Reconcile with your result above.*
*Step 4.2: Call the function setting all three parameters set. (Choose any reasonable values for superlateLimit and lateLimit.)*
*Step 4.3: Call the function setting delays and superlateLimit (Tip: use the same value of superlateLimit as Step 4.2 so you can double check your code. This is known as cross-validation and is a similar concept to cross-examination of witnesses in courts).*
*Step 4.4: Call the function setting delays and lateLimit (Tip: use the same value of lateLimit as Step 4.2 to cross-validate).*
*Step 5. Repeat Step 4 for a different set of values. (In real life, you'll do much more extensive testing before finalizing your work.)*
```{r}
# Do not need to edit line below here until the next comment
lateCountsPlus <- function(delays, superlateLimit = 480, lateLimit = 0) {
# Do not need to edit line above here until the previous comment
# Your code goes here
# Do not need to edit line below here until the next comment
if (nalate + superlate + late + notlate == counter) {
return (c(superlate, late, notlate))
} else {
return (c(-1,-1,1))
}
}
# Do not need to edit line above here until the previous comment
# Do not edit the line below to test lateCountsPlus
lateCountsPlus(flights[,"dep_delay"])
#Add 4 more sample calls. 2 with both default values and 2 with one of each default parameters.
```
*19. Now, write a function that takes a numeric variable delay and classifies based on optional parameters superlateLimit (default 480) and lateLimit (default 480), using the same logic as above for the cutoffs. This time your output should be ("N/A", "superlate", "late", "notlate") respectively. *
```{r}
# Your function goes here
# After completing this function, test in on dep_delay for rows 1, 4, 152, and 839 to confirm that your function works correctly.
# Your answers should be "N/A", "superlate", "late", "notlate", respectively.
```
*20. Now, compute lateFlightsS by using sapply to apply lateCount1 on the dep_delay column (of flights), without setting any optional parameters.*
*Print your result as a vector containing the number of nalate, superlate, late, and notlate, respectively.*
```{r}
```
*21. Now, compute lateFlightsL by using lapply to apply lateCount1 on the dep_delay column (of flights), without setting any optional parameters. Print your result as a vector containing the number of nalate, superlate, late, and notlate, respectively.*
```{r}
```
*22. Print the types of lateFlightsS and lateFlightsL.*
```{r}
```
*23. Explain your result above.*
*Hint: No R Code required for this answer. (Use # indicate that so R can ignore your text.)*
```{r}
```
*24. Now, compute lateFlightsV by using vapply to apply lateCount1 on the dep_delay column (of flights), without setting any optional parameters.*
*Print your result as a vector containing the number of nalate, superlate, late, and notlate, respectively.*
*Print the type of lateFlightsV.*
*Hint: Choose an appropriate type for vapply.*
```{r}
```
*25. Now, recompute lateFlightsS, lateFlightsL, and lateFlightsV by using sapply, lapply, and vapply, respectively, to apply lateCount1 on the dep_delay column (of flights).*
*Set superlateLimit = 120 and lateLimit = 60.*
*Print your result as a vector containing the number of nalate, superlate, late, and notlate, respectively.*
*Hint: Choose an appropriate type for vapply.*
```{r}
```
*26. Run the following code chunk*
```{r}
#Do not edit this code chunk.
paste("Departure Reported =", flights[5,"dep_time"])
paste("Departure Reported =", flights[5,"sched_dep_time"])
paste("Delay Reported =", flights[5,"dep_delay"])
paste("Delay Calculated", flights[5,"dep_time"] - flights[5,"sched_dep_time"])
```
*27. Why is there a difference between the delay values in "Delay Reported" and "Delay Calculated"?*
*Hint: No R Code required for this answer. (Use # indicate that so R can ignore your text.)*
```{r}
```
*28. Add a new column called Date to flights dataframe to store the date based on the year, month, and day columns.*
*Print the first ten unique values of the Date column (of flights).*
*Print the class of the Date column (of flights).*
*Tip: paste(yyyy, mm, dd, sep = "-") will give you a string in the yyyy-dd-mm format.*
*Tip: as.Date("2013-01-01") will store "2013-01-01" as a Date data type.*
*Hint: See http://www.datasciencemadesimple.com/unique-function-in-r/ for help in unique. (Reminder: I will test for your memorization in the final exam.)*
```{r}
```
*29. Check your answer against time_hour field using the following steps:*
*Step 1. difftime(datetime1,datetime2, units = "hours") will give you difference between datetime1 and datetime2 in hours.*
*Step 2. check the max, mean, median, and min of the result of the Hint 2 above to confirm that the two variables sure Date and time_hour are within about a day or less.*
```{r}
```
*30. Run the following lines of code and see if you understand the pattern.*
```{r}
#Do not edit this code chunk.
LETTERS[1]
LETTERS[2]
LETTERS[3]
LETTERS[4]
```
*31. Generate the sequence of numbers from 1 to 26.*
```{r}
```
*32. Generate the sequence of characters from "A" to "Z".*
```{r}
```
*33. Write one line code that would count the number of occurrences of letter "A" in flights$dest.*
```{r}
```
*34. Write a for loop to go through all letters and print out the number of occurrences of each letter in dest column in flights.*
*Hint: print(paste("X", N)) prints "X N" in one line (where X is a character and N can be converted into a character).*
```{r}
```
*35. Knit to html after eliminating all the errors. Do not worry about minor formatting issues.* *Tip: This will take some time as you are processing medium size data sets.*