7CCSMBDT – Big Data Technologies Practical
Data preparation tasks using R
As mentioned in Lecture 1, the analytics flow for Big Data starts with data collection and then proceeds into data preparation. In this practical, we will use already collected data that have been stored as .CSV files. The objective is to perform a range of data preparation tasks using R.
Try as many examples as you can, looking at the references in Section 1 below. If you have time, try also the 2 optional questions.
1. Overview of commands in R
a) Main commands with minimal examples can be found at
https://www.personality-project.org/r/r.commands.html
b) Type the following to learn more about a command:
help(command)
c) Detailed reference for commands:
http://stat.ethz.ch/R-manual/R-patched/library/base/html/00Index.html
2. Getting started
(a) Download all datasets (Employee, Shopping, Vitals) from KEATS ( or from https://www.dropbox.com/sh/sqerdjl6c8nspsx/AAAObWqOMnO4QfeUJ_7PCioia?dl=0 )
(b) Store the datasets in a subdirectory of your home directory (e.g., BDT). For creating a subdirectory you will need the mkdir command.
(c) Open Rstudio or R.
(d) See your working directory with getwd(), and change it to the subdirectory you created
with setwd(“/home/X”), where “X” is the subdirectory you created.
(e) Create a dataframe Employees from the dataset Employees.csv using read.csv(). Do the
same for the Shopping and Vitals datasets.
Sol: Employee<-read.csv(“Employee.csv”), Vitals<-read.csv(“Vitals.csv”), Shopping<- read.csv(“Shopping.csv”)
7CCSMBDT – Big Data Technologies Practical
(f) Invoke the editor on the dataset using fix(). Click on the cell in the 1st row and 2nd column with the mouse and then change its value to NA. Then click Quit (top right corner of the editor window). Last, use fix() again to see the changes you have made.
Sol: fix(Employee)
(g) Print the dataframe for Vitals without displaying any decimal digits for Result (e.g., 32 instead of 32.2)
Sol: print(Vitals,digits=0) 3. Missing values
In R, missing values are represented with NA. You have added a missing value in Employee before. For the questions below, read the reference for:
is.na(), rowSums(), colSums(), complete.cases(), na.omit(), mean(), sd(), max(), min() (a) Identify the row and column of the missing value in Employee, using rowSums() and
colSums().
Sol: rowSums(is.na(Employee))
colSums(is.na(Employee))
(b) Display the records with missing values in Employee, using complete.cases()
Sol: Employee[!complete.cases(Employee),]
(c) Create a new data frame, Employee2, after omitting missing values from Employee.
Sol: Employee2<-na.omit(Employee)
(d) Compute the minimum, maximum, arithmetic mean, and standard deviation of the Employee’s attribute Age. Do the same on Employee2’s attribute Age. What do you observe?
Sol: min(Employee$Age), max(...), mean(...), sd(...) yield NA. The same functions on Employee2$Age yield numerical results.
(e) Compute the minimum, maximum, arithmetic mean, and standard deviation of the attribute Age in Employee, after excluding missing values from analysis.
Sol: We need to use na.rm=TRUE as second argument to the functions. E.g., min(Employee$Age, na.rm=TRUE)
7CCSMBDT – Big Data Technologies Practical
4. Creating New Variables.
For the question below, recall that a$b is used to reference the attribute b of the dataframe a. Also, see the reference manual for the command abs().
(a) Each employee in Employee got a salary increase of 7%. Add two columns in the dataframe Employee, one called Increase, which contains the value of the increase, and another called NewSalary, which contains the salary of each employee after the increase.
Sol: Employee$Increase<-Employee$Salary*0.07 and Employee$NewSalary<- Employee$Salary+Employee$Increase
(b) Each employee in dataframe Employee2 that you created in Question 3 got a salary increase of 3%. Do the same as 4(a) for Employee 2.
Sol: Same as 4(a) but for Employee2
(c) Calculate the absolute value of the difference between the mean of the NewSalary attribute in Employee and the mean of the NewSalary in Employee2.
Sol: abs(mean(Employee$NewSalary)-mean(Employee2$NewSalary))
5. Sorting data
For this question, see the reference for the command order().
Create a new dataframe Employee3, which is the same as the Employee dataframe but is sorted with respect to Age in decreasing order, breaking ties with Salary in increasing order, and contains the missing values first.
Sol: Employee3<-Employee[order(-Employee$Age,Employee$Salary,na.last=FALSE),]
6. Identifying and removing duplicate data
For this question, see the reference for duplicated() and unique()
(a) Print the position of all duplicate elements in the attribute Age of Employee
Sol: duplicated(Employee$Age)
(b) Create a dataframe Uniq_Age containing only records with unique values in the attribute Age of Employee.
Sol: Uniq_Age<-Employee[!duplicated(Employee$Age),]
7CCSMBDT – Big Data Technologies Practical
(c) Create a vector Uniq_Age_vec containing only the unique values of the attribute Age of Employee
Sol: Uniq_Age_vec<-unique(Employee$Age)
7. Data filtering
Read the reference for which() and subset()
(a) Create a new dataframe Mul_con1 containing only the records of Employee with Age
lower than 40 and Gender equal to Male, using which().
Sol: Mul_con1<--Employee[which(Employee$Age<40 & Employee$Gender==”Male”),]
(b) Create a new dataframe containing the Emp_Id and Age values of each record in Employee with Age>=40 or Age<20.
Sol: Test<-subset(Employee,Age>=40 | Age<20, select=c(Emp_Id,Age))
8. Conditional Processing.
Read the reference for ifelse(). This serves as a shorthand for the if(){...}else{...} statement. Create an attribute Salary_group with value High if the Salary attribute of Employee is at
least 50000, Medium if it is over 20000 and less than 50000, and Low if it is at most 20000. Sol: Employee$Salary_group<-ifelse(Employee$Salary>20000 &
Employee$Salary<50000,”Medium”,ifelse(Employee$Salary>=50000,”High”,”Low”))
9. Modifying variables
Read how to subset variables from http://www.statmethods.net/management/subset.html a) Execute: Shopping<-read.csv(“Shopping.csv”)
b) Create a new dataframe Sub_shop with the following attributes of Shopping:
“Brand”, “Safety”, “Look.”
Sol: myvars <- c(“Brand”, “Safety”, “Look”) Sub_shop <- Shop[myvars]
c) Create a new dataframe Sub_shop2 with all attributes of Shopping except the attribute with ids (indexes) 4 and 6.
Sol: Sub_shop2<-Shop[c(-4,-6)]
7CCSMBDT – Big Data Technologies Practical
d) Execute fix(Sub_shop). Click on an attribute name and Select Change Name. Then type a new name. Click quit and then execute fix(Sub_shop) again. What do you observe?
Sol: The name of the attribute changed.
-------------------------- Optional questions (try them if you have time) -----------------------------------
10. Functions for removing whitespace and string substitution.
Read the reference for trimws() and substr().
a) Execute: phrase<-“ Some text to test functions in R “
b) Remove the whitespace characters from phrase.
Sol: trimws(phrase)
c) Change phrase to “Something to test functions in R” with a single command. Sol: substr(phrase,5,9)<-”thing”
11. Combining Datasets.
Read the reference for rbind(), cbind(), dim(), typeof(), as.character(),merge().
a) Create a new dataframe emp1 containing the first and last record of Employees. Assume that you do not know the index of the last record.
Sol: emp1<-rbind(Employee[1,],Employee[dim(Employee)[1],])
b) Execute typeof(Employee$Travel). What do you observe?
Sol: The type of Employee$Travel is integer.
c) Create a new dataframe emp2 containing the first and last attribute of Employees. Assume that you do not know the index of the last attribute.
Sol: emp2<-cbind(Employee[,1],as.character(Employee[,dim(Employee)[2]]))
The as.character() is used because of the type of Employee$Travel. Withouti it, we will get a column with integers corresponding to the values of Employee$Travel.
7CCSMBDT – Big Data Technologies Practical
d) Create a new dataframe emp3 with two rows: [Emp_Id=10212, New=1] and [Emp_Id=10188,New=1]. Merge emp1 and emp3.
Sol: emp3<-data.frame(Emp_Id=c(10212,10188),New=c(1,1)) and merge(emp,emp3,by=c(“Emp_Id”))