DSCC 201/401
Tools and Infrastructure for Data Science
March 8, 2021
• Brief history and overview
• R interfaces
• Language syntax and examples • Useful libraries
R
2
What is R?
• Statistical programming language based on the S programming language
• Created as a “free” version of S (Unix vs. Linux) in 1990s
• Interpreted language: Need an interpreter
• Dynamic typing
• Excellent statistical (including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques
• Highly extensible (user functions, CRAN (Comprehensive R Archive Network), etc.)
3
The R Environment: More Than Just Statistics
• An effective data handling and storage facility
• A suite of operators for calculations on arrays, in particular matrices
• A large, coherent, integrated collection of intermediate tools for data analysis
• Graphical facilities for data analysis and display either on-screen or on paper
• A well-developed, simple, and effective programming language which includes:
• Conditionals
• Loops
• User-Defined Functions
• Input and Output Facilities
4
R is Portable
• Runs on Desktop systems (Mac OS X, Windows, Linux, BSD, etc.) • Apps for iOS and Android
• GUI not required
• Runs on Linux clusters and supercomputers
5
6
• Interactive Mode
• Batch Execution Mode
Two Ways to Run R
7
Hello, World!
cat(“Hello, world!\n”)
8
Running R in Interactive Mode
• No Graphical User Interface (No GUI) • Graphical User Interface (GUI)
9
Running R in Interactive Mode with No GUI
• R Command Line Interface via SSH
• SSH to bluehive.circ.rochester.edu • Load the R module and start R
• R Command Line Interface on BlueHive Desktop • Go to https://bluehive.circ.rochester.edu
• Request resources
• Launch terminal
• Load the R module and start R
10
R in Interactive Mode
11
Running R in Interactive Mode with GUI
• RStudio
• Go to https://bluehive.circ.rochester.edu
• Request resources
• Launch RStudio from menu
(Applications -> Data Analysis -> rstudio -> 1.0.143)
12
RStudio
13
• Interactive Mode
• Batch Execution Mode
Two Ways to Run R
14
Running R in Batch Execution Mode
• Do not need a GUI
• Typically used when running a script that is part of a workflow or data analysis processing pipeline
15
Running R in Batch Execution Mode (Example)
• Create a file: hello.r
• module load R
• R CMD BATCH hello.r hello.out
16
Running R in Batch Execution Mode (Example)
• We can run the batch processing on the BlueHive desktop (assuming appropriate resources)
• We can also create a Slurm script to submit an R batch processing job to the BlueHive cluster
17
Running R in Batch Execution Mode (Example)
#!/bin/bash
#SBATCH -p debug
#SBATCH -c 1
#SBATCH -t 1:00:00
#SBATCH –mem=2GB
#SBATCH -o test.out
#SBATCH -e test.err
#SBATCH -J r_test
module load R/3.5.1/b1
R CMD BATCH hello.r hello.out
18
Objects in R
• Scalars and Characters
1
• Vectors
• Matrices
c(1,2,3)
matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
list(1,2,3,”hello”,sqrt)
• Data Frames (“table” or “heterogeneous matrix”) • Factors (“categorical data”)
• Functions (operations on objects)
• Lists (i.e. “heterogeneous vectors”)
19
• R is a calculator (operations on scalars) • Assignment of scalars and vectors
• Simple functions
• Sequence generators
• Operations on vectors
Scalars and Vectors
20
Useful Commands
• Ctrl-C: Cancel the current input
• Ctrl-L: Clear the screen
• Up-Arrow: Go back up to previous commands
• Down-Arrow: Go back down through to current command • ?command: Show details about the function
21
Workspace Image and Objects
history()
history(max.show=Inf)
q()
save.image(“project1.RData”)
load(“project1.RData”)
ls() rm()
str()
22
• Matrix creation and representation • Matrix operations
• Solving systems of linear equations • Eigenvalues and eigenvectors
Matrices
23