STA 2300: Introduction to Data Science Lecture 13: Basic Programming
Today’s Topics
The which() Command Finder Commands
The for() Loop
The which() Command
So far, we have filtered datasets in multiple ways. We have used logical TRUE/FALSE vectors with either the subset() command or square brackets. To review, here are some examples:
cities.
Another approach that has been touched upon to identify the row numbers of interest is to use the which() command. The main difference between using which() and just selecting those rows that satisfy a certain criteria is that the which() will return the row numbers instead of using a TRUE/FALSE sequence. These row numbers are often referred to as the
1
stringsAsFa
#Reading in housing sale dataset
sales2019 <- read.csv(file="boulder-2019-residential_sales.csv", header = T, #Getting rid of the commas and dollar signs
sales2019$SALE_PRICE <- gsub(",", "", sales2019$SALE_PRICE) sales2019$SALE_PRICE <- as.numeric(gsub("\\$", "", sales2019$SALE_PRICE))
#(1) Using the `subset` command to select the data associated with specific
city1 <- subset(sales2019, sales2019$CITY == "BOULDER" | sales2019$CITY == "LONGMONT" | sales2019$CITY == "LAFAYETTE",
select = c(SALE_PRICE, CITY))
#(2) Using a logical T/F sequence to select rows
top_cities <- (sales2019$CITY == "BOULDER") | (sales2019$CITY == "LONGMONT") |
(sales2019$CITY == "LAFAYETTE") city2 <- sales2019[top_cities, ]
index of the row. The benefit here is that it makes it a bit easier to perform the coding in two steps, and it makes it easier to verify that the filtering performed as you intended.
#(3) Using which() to identify row numbers
top_index <- which((sales2019$CITY == "BOULDER") | (sales2019$CITY == "LONGMONT") | (sales2019$CITY == "LAFAYETTE"))
head(sales2019$CITY, 10)
# [1] "LAFAYETTE" "BRECKENRIDGE" "BOULDER" # [6] "LONGMONT" "LONGMONT" "LONGMONT" city3 <- sales2019[top_index, ]
Finder Commands
"BOULDER" "LONGMONT"
"ERIE" "LAFAYETTE"
Related to the which() commands are some commands that make finding and identifying unusual or special values very easy. Let’s see how each of these works.
• identify()
• locator()
• which.max() and which.min()
# identify() is interactive. You click your mouse on one or more points,
# and when done, click escape. Then, the row numbers/index of those points are shown.
plot(sales2019$ABOVE_GROUND_SQFT, sales2019$SALE_PRICE)
sales2019$SALE_PRICE
0.0e+00 2.5e+07
0 2000 4000 6000 8000 10000
sales2019$ABOVE_GROUND_SQFT
#special <- identify(sales2019$ABOVE_GROUND_SQFT, sales2019$SALE_PRICE)
#sales2019[special,]
# locator() is interactive, but must be applied to an existing plot.
# You click your mouse on one or more points, and when done, click escape.
# Then, the x/y coordinates of those points are shown.
2
plot(sales2019$ABOVE_GROUND_SQFT, sales2019$SALE_PRICE, pch = 19, col = rgb(0, 0, 0, 0.1))
0 2000 4000 6000 8000 10000
sales2019$ABOVE_GROUND_SQFT
#(xy_special <- locator())
# which.max() and which.min() return the index of the observation with either
# the largest or smallest value in the set.
which.max(sales2019$SALE_PRICE)
# [1] 1134
which.max(sales2019) #Thus, can only be applied to one numeric variable at a time # Error in which.max(sales2019): 'list' object cannot be coerced to type 'double'
which.min(sales2019$SALE_PRICE)
# [1] 785
# sales2019[785,]
low_price <- which(sales2019$SALE_PRICE <= 5) # sales2019[low_price, ]
sales2019$SALE_PRICE
0.0e+00 2.5e+07
3
The for() Loop
for() loops can be very helpful when you need to perform a task repeatedly. They exist in most programming languages, but in R, the general thinking is to try to code without a loop, if possible, because loops can be slower than other options. However, our computations will not be time-consuming, and learning to program loops is a useful skill.
Here is how we may visualize a loop conceptually.
Figure 1: Visual representation of a for() loop.
Here are some simple for() loops to help see the syntax in R. Important components are:
• The condition is inside the parentheses after the for.
• The indexer or counter, oftentimes an i, j, or k is used.
• The curly braces begin and end the loop.
• The code in the body of the loop is executed on each trip.
• The indexer/counter is automatically increased by one after each trip.
for(i in 1:10){ print(i)
}
# [1] 1
# [1] 2
# [1] 3
# [1] 4
# [1] 5
# [1] 6
# [1] 7
# [1] 8
4
# [1] 9 # [1] 10
messages <- c("Happy Birthday", "Happy New Year", "Congratulations", "Super Job!", "Get for(k in 1:length(messages)){
print(messages[k])
}
# [1] "Happy Birthday"
# [1] "Happy New Year"
# [1] "Congratulations"
# [1] "Super Job!"
# [1] "Get Well Soon"
In the next examples, you will see for() loops used in plotting, data wrangling, and in uncertainty analysis.
**Example, Plotting Data:** Recall that in a prior exercise, you plotted the daily percent full of reservoirs for the years 2016 to 2020. You may have done it like this:
www <- "https://www.waterdatafortexas.org/reservoirs/statewide.csv" water <- read.csv(file=www, header=T, skip=29)
suppressMessages(library(lubridate))
plot(1:365, seq(50, 100, len=365), type = "n", xlab = "", ylab = "")
water_year <- year(water$date)
lines(1:366, water$percent_full[water_year == 2016], col = 1) lines(1:365, water$percent_full[water_year == 2017], col = 2) lines(1:365, water$percent_full[water_year == 2018], col = 3) lines(1:365, water$percent_full[water_year == 2019], col = 4) lines(1:366, water$percent_full[water_year == 2020], col = 5)
0 100 200 300
5
50 70 90
However, those lines of code are very repetitive, and very little changes from one line to the next. Furthermore, if you wanted to plot the last 50 years, then the above approach would not be very efficient. This is a perfect situation in which to use a loop.
plot(1:365, seq(50, 100, len=365), type = "n", xlab = "", ylab = "") water_year <- year(water$date)
for(i in 2016:2020){
one_year <- water[water_year == i, ]
num_days <- nrow(one_year)
lines(1:num_days, one_year$percent_full, col = 2021-i)
}
0 100 200 300
Now, we may want to extend and plot the last 50 years, and it would be nice if the colors would change from one year to the next in a sensible way.
suppressMessages(library(fields)) suppressMessages(library(viridis))
color_by_year <- color.scale(1971:2020, col = viridis(50)) plot(1:365, seq(50, 100, len=365), type = "n", xlab = "", ylab = "") for(i in 1971:2020){
one_year <- water[water_year == i, ]
num_days <- nrow(one_year)
#lines(1:num_days, one_year$percent_full, col = i)
#lines(1:num_days, one_year$percent_full, col = color_by_year[i]) #No lines plot. lines(1:num_days, one_year$percent_full, col = color_by_year[2021 - i])
}
6
50 70 90
50 70 90
0 100 200 300
Example, Replacing Missing Values: In this example, we’ll use a for() loop to look for missing values in the raw Eagle Mountain Lake temperatures and then replace them with the average of prior and subsequent values at same depth. The first step in constructing a loop is to test out the code for a single iteration. Then, adapt it for stepping through a loop.
temp <- read.csv(file = "temp_through_09_12_2019.csv", header=T)
dim(temp)
# [1] 1692 23
colnames(temp)
# [1] "Observation" "DateTime" "X0"
# [6] "X1.5"
# [11] "X4"
# [16] "X6.5"
# [21] "X9"
variable.names<-c("Obs","DateTime","D0.0", "D0.5", "D1.0", "D1.5", "D2.0","D2.5","D3.0", colnames(temp) <- variable.names
#This will hold the existing data and the imputed values. #Important for not overwriting the raw data.
new_temp <- temp
# Step (1)
# Initially looking at just the first column of temperatures
# summary(temp)
which(is.na(temp$D0.0))
# [1] 747 748 909 910 1182 1183 1202 1211 1212 1216 1217 1218 1221 1222 1266 # [16] 1268 1269 1346 1348
missing <- which(is.na(temp$D0.0))
missing
# [1] 747 748 909 910 1182 1183 1202 1211 1212 1216 1217 1218 1221 1222 1266 # [16] 1268 1269 1346 1348
temp_before <- temp[missing - 1, 3]
7
"X2" "X2.5"
"X4.5" "X5"
"X7" "X7.5"
"X9.5" "X10"
"X0.5" "X1"
"X3" "X3.5"
"X5.5" "X6"
"X8" "X8.5"
temp_after <- temp[missing + 1, 3] cbind(temp_before, temp_after)
#
# [1,]
# [2,]
# [3,]
# [4,]
# [5,]
# [6,]
# [7,]
# [8,]
# [9,]
# [10,]
# [11,]
# [12,]
# [13,]
# [14,]
# [15,]
# [16,]
# [17,]
# [18,]
# [19,]
temp_before temp_after
27.000
NaN
29.572
NaN
28.625
NaN
29.500
28.625
NaN
28.500
NaN
NaN
28.500
NaN
29.250
29.000
NaN
31.170
30.621
NaN
26.625
NaN
30.134
NaN
28.375
29.000
NaN
28.875
NaN
NaN
28.500
NaN
29.875
29.000
NaN
29.564
30.621
30.187
imputed_values <- apply(cbind(temp_before, temp_after), 1, mean, na.rm = TRUE) new_temp[missing, 3] <- imputed_values
# Step (2)
# Now, repeating this process for each column
for(i in 3:23){
missing <- which(is.na(temp[, i] == TRUE)) # print(missing)
temp_before <- temp[missing - 1, i] temp_after <- temp[missing + 1, i]
imputed_values <- apply(cbind(temp_before, temp_after), 1, mean, na.rm = TRUE) # print(imputed_values)
new_temp[missing, i] <- imputed_values
}
# summary(new_temp)
8
Example, Monte Carlo Analysis: A Monte Carlo analysis is one in which you sample repeatedly from a particular distribution and then perform some analysis of that distribution. It gets its name from the principality of Monte Carlo, known in part for its casinos. In a survey distribution to freshmen students at Mines enrolled in their first mathematics class, therewereaskedto“Markonerandomspotinsideofthesquare.”1 Thesevalueswerecollected and can be plotted with the code below.
suppressMessages(library(spatstat))
data <- read.csv(file="dot_experiment_data.csv", header=T) open <- which(data$type=="empty")
#Rescaling data to be on the unit square
x1 <- data$x[open]/15
y1 <- data$y[open]/15
plot(x1, y1, col = rgb(0, 0, 0, 0.35), pch=19, xlab="", ylab="", bty="n", xaxt="n", yaxt lines(c(0,0),c(0,1))
lines(c(0,1),c(1,1))
lines(c(1,1),c(1,0))
lines(c(1,0),c(0,0))
1Hering, A. S., Durell, L., Morgan, G. (2021) “Illustrating randomness in statistics courses with spatial experiments,” The American Statistician, Now Online.
9
t square.
win <- owin(c(0, 1), c(0, 1))
data_open <- ppp(x1, y1, window = win)
# Warning: data contain duplicated points
#This computes the distance from each point to its nearest neighbors. #Then, the median of these distances is computed.
open_nndist <- median(nndist(data_open))
#This is how you simulate 314 points distributed randomly inside of the uni
sim.dat <- runifpoint(314, nsim = 1)
plot(sim.dat, main="", pch = 19, col = rgb(0, 0, 0, 0.1))
median_nndist <- NULL
B <- 1000
for(i in 1:B){
sim.dat <- runifpoint(314, nsim = 1)
median_nndist[i] <- median(nndist(sim.dat)) }
hist(median_nndist, xlim=c(0.020, 0.032),
xlab = "Median Nearest Neighbor Distance", main = "Monte Carlo Simulation", freq = F)
abline(v = open_nndist , col = "red", lwd = 2) text(0.0223, 200, "Observed ", col = "red") text(0.0225, 170, "Median NND", col = "red")
Monte Carlo Simulation
Observed Median NND
0.020 0.022 0.024 0.026 0.028 0.030 0.032
Median Nearest Neighbor Distance
Thus, we can conclude that the observed median nearest neighbor distance for the points chosen by the students is very different than what would be expected if the 314 points were just randomly distributed throughout the square.
10
Density
0 100 250