DSC201/401 Lecture Spring 2021
Statistical Analysis System (SAS)
Instructor: Will DiGrazio, MS, MBA
What is SAS
• SAS is a collection of modules that are used to process and analyze data
• SAS began in the late 60’s and early 70’s as a statistical package (the name SAS originally stood for Statistical Analysis System)
• However, unlike many competing statistical packages, SAS is also an extremely powerful, general-purpose programming language
• SAS is a predominant software choice in the Pharmaceutical industry and most Fortune 500 companies
Learning SAS by Example: A Programmer’s Guide, Ron Cody – SAS Publishing, 2007
Capabilities of SAS
SAS has been enhanced to provide state-of-the-art data mining tools and programs for Web development and analysis
Examples of SAS Components
Base SAS – Basic procedures and data management
SAS/AF – Application frames
SAS/STAT – Statistical analysis
SAS/QC – Quality control
SAS/GRAPH – Graphics and presentation
SAS/Life Sciences Analytics Software – Clinical Research
SAS/OR – Operations research
SAS Model Manager – Model optimization
SAS/ETS – Econometrics and Time Series Analysis
SAS Visual Analytics – Data visualization
SAS/IML – Interactive matrix language
Business Intelligence & Analytics – Enterprise Data Discovery
Pause
Why Use SAS
Reasons)
• Prevalent in Industry, SAS has the largest market share of any commercial statistical software package based on the metric of average market capitalization
• Locally, Xerox and Excellus use SAS and the majority of Banks, Energy and Pharmaceutical companies
• Past UofR Data Science graduates have found jobs that require the knowledge and use of SAS for their work
at airlines and credit card companies, to name a few
• Businesses get Indemnification (security against legal liability for one’s actions)
• All updates and functionality have been validated for accuracy
• There are some funding agencies that require the research analysis be performed by SAS in order to remove the likelihood of erroneous results from unverified packages
• The support from the company is advanced and considered excellent
• Data handling processes are more efficient in handling Big Data
• The programming language is very simple to learn
• If you know SQL, you can use the same code in a SAS Proc SQL statement
• R code can be processed in certain SAS modules
Pause
Let’s Get Started
Use FastX To Connect To BlueHive • Walk through the instructions at https://info.circ.rochester.edu/BlueHive/FastX.html
Use FastX To Connect To BlueHive
• After logging in with your credentials and accepting the DUO request, select the launch session button
Use FastX To Connect To BlueHive • SelecttheDefaulticonandthenclicktheblueLaunchbutton
Use FastX To Connect To BlueHive • ClicktheblueClosebutton
Use FastX To Connect To BlueHive
• Click the blue font mate-session link in order to open a BlueHive compute session – NOTICE this particular compute session is on the BHC0087 compute node
Work With Course Related Files • Create a folder on your desktop by right clicking and selecting Create Folder
• Namethe folder
MySAS
Work With Course Related Files • Open up a terminal session so that we can copy over our course related files
Work With Course Related Files
• Type cp -R /public/wdigrazi/Data /home/yourNetID/Desktop/MySAS at the $ command prompt
To copy a directory, including all its files and subdirectories, to another directory, use the –R parameter after the cp command
Work With Course Related Files
• Type cd ~/Desktop/MySAS at the $ command prompt to change the directory location to the MySAS folder on your desktop
Work With Course Related Files
• Type ls at the $ command prompt in order to list the contents of the ~/MySAS folder on your home Desktop
• You should see the blue folder called Data listed as your Output
• Exittheterminal window by
typing exit at the $ Command prompt
Work With Course Related Files
• Alternatively, you could just left double click on the MySAS folder
• If you see many items in the
window, then you were successful copying over the folder and subsequent files
Open SAS On BlueHive (Option 1) • Open up a terminal session so that we can invoke SAS on a compute node
Open SAS On BlueHive (Option 1) • At the command prompt, type: module load sas
Open SAS On BlueHive (Option 1) • At the command prompt, type: sas
Initial SAS Software • Close the Change Notice window by clicking the close button
Open SAS On BlueHive (Option 2)
• Now that you have a BlueHive session open, click the Applications menu item and then click the Data Analysis item and then the SAS item and finally the 9.4b item
Open SAS On BlueHive (Option 2)
• For the pop up window that appears, just click the close button. Or if you don’t want this to appear anymore, check the Don’t show this dialog again and then click Close
Tip For Identifying The Head Node
netid@bluehive = Head Node Not an issue if you always use FastX
Tip For Identifying The Compute Node
Anything other than netid@bluehive = Compute Node Always assigned a compute node automatically when using FastX
• SAS: Explorer is used for accessing SAS files
Navigating The Display
Navigating The Display • SAS: Results is used to navigate through the many results that are created
Navigating The Display • SAS: Log is used to show the log output from a program that is run
Navigating The Display • SAS: Program Editor is used to write and run your SAS program code
Navigating The Display • SAS: Output is used to visually display the output details from your program
Set Up Your Windows • Take a moment to arrange your SAS windows the way you want them
Setting Up Your Environment Step 1) Goto tools➔options➔preferences➔DMS tab
Before
After
Setting Up Your Environment Step 2) Goto tools➔options➔preferences➔Editing tab
Before
After
Setting Up Your Output Step 3) Goto tools➔options➔preferences➔Results tab
Setting Up Your Output
Step 4) Click the “view results as they are generated” option so it looks raised
Before
After
Setting Up Your Output
Step 5) Ensure the Create Listing option is depressed, as this sends output to the built in SAS output window
Correct Incorrect
Setting Up Your Output
Step 6) Click the Select button and navigate to the directory /gpfs/fs1/home/yourNetID/Desktop/MySAS/Data/
Setting Up Your Output
Step 7) Click OK
Navigating The Display
• There are more windows that open in your SAS session as you invoke them, such as the Graph window and SAS Session Management window
Pause
Run First Sample Program
• This will provide a good overview of the structure of SAS programs
• We will use the veggies.txt file, which contains the following pieces of information that are separated by spaces
• Vegetable Name
• Product Code
• Days to Germination
• Number of Seeds
• Price
• In SAS terminology, each piece of information is called a variable
• Let’s start with a simple SAS program
• Reads data from a text file
• Produces some basic reports
Run First Sample Program • In SAS terminology, each line of data produces what is called an observation
Veggies.txt
Cucumber 50104-A 55 30 195 Cucumber 51789-A 56 30 225 Carrot 50179-A 68 1500 395 Carrot 50872-A 65 1500 225 Corn 57224-A 75 200 295 Corn 62471-A 80 200 395 Corn 57828-A 66 200 295 Eggplant 52233-A 70 30 225
• How many observations are there and how many variables are there?
Run First Sample Program • In the Program Editor, type the following lines of SAS code
Run First Sample Program • Run your program by clicking Run in the menu and Submit
Run First Sample Program • View the differences after the run is complete
• SAS: Explorer window stayed the same
Run First Sample Program
Run First Sample Program • SAS: Results window has three new items (Print, Freq, and Means)
Run First Sample Program • SAS: Log window has output that explains what happened during the run
Run First Sample Program • SAS: Program Editor is empty – Why?
• SAS:Outputwindowhasresultingoutput
Run First Sample Program
• Go back to the SAS windows
Run First Sample Program
Run First Sample Program
• Recall your program in order to save it to your MySAS/Data/ directory
• Click on Run➔Recall Last Submit in the SAS: Program Editor window
Run First Sample Program • Save your program by selecting File and Save As in the Program Editor menu
Run First Sample Program
• Change the directory path to /home/yourNetID/Desktop/MySAS/Data and name the SAS program as veggiefirst.sas and click OK
Exit Out Of SAS • Exit SAS by selecting File and Exit in the Program Editor menu
Exit out of SAS ▪ Open SAS again
Invoke SAS On A Compute Node (Option 1) • At the command prompt, type: sas
• Close the Change Notice window by clicking the close button
Open SAS
Run First Sample Program • Goto Applications➔System Tools➔File Browser
• View the HTML output that was created
Run First Sample Program • Expand the directories until you get to the MySAS directory
Run First Sample Program • Scroll down until you see the sashtml.htm file
Run First Sample Program
• Left double click the sashtml.htm file to open the output in the Firefox browser
Run First Sample Program
• Run Firefox on the Head Node➔Open a terminal window and module load firefox, then type firefox and in the URL text box type:
• file:///home/yournetid/Desktop/MySAS/Data/sashtml.htm
Controlling HTML Output File
Controlling HTML Output File
• ods html body=‘your_file_name.htm’ style=HTMLBlue; Starts the output
• ods html close; Ends the output
Controlling HTML Output File
• Example of the ttest2.htm file being created when the SAS program is run
Run First Sample Program
• Confirm the veggiefirst.sas file was saved to /home/yourNetID/Desktop/MySAS by using the File Browser
Open First Sample Program
• Open your veggiefirst.sas file through the SAS interface of the Program Editor Window by selecting File and Open
Open First Sample Program • Select the veggiefirst.sas file and click OK
Run First Sample Program
• In the Program Editor, the following lines of SAS code will appear
Run First Sample Program • Run your program by clicking Run in the menu and Submit
Run First Sample Program
• Recall your program in order to save it to your MySAS/Data/ directory
• Click on Run➔Recall Last Submit in the SAS: Program Editor window
Introduction Is Complete
Pause
Foundations of the SAS Program
NEXT TOPICS
• SAS: Program Editor Line Commands
• SASNames
• SAS Data Sets and SAS Data Types
SAS: Program Editor Line Commands
• Copylinesorblocksoflines ➢ c[N] copy 1 or n lines
➢ cc copy block
SAS: Program Editor Line Commands • Determine the location of lines to be pasted
➢ a ➢ b ➢ o
paste after
paste before
paste over existing blanks
SAS: Program Editor Line Commands
• Move lines or blocks of lines ➢ m[N] move
➢ mm move block
SAS: Program Editor Line Commands
• Delete lines or blocks of lines ➢ d[N] delete
➢ dd delete block
SAS: Program Editor Line Commands • Insert blank lines or blocks of blank lines
➢ ia[N]
➢ i[N]
➢ ib[N]
insert after insert after
insert before
SAS Names • SAS names follow a simple naming rule:
• All SAS variable names and data set names can be no longer than 32 characters and must begin with a letter or the underscore ( _ ) character.
• The remaining characters in the name may be letters, digits, or the underscore character.
• Characters such as dashes and spaces are not allowed.
Valid SAS Names
Invalid SAS Names
SAS Data Sets and SAS Data Types
• When SAS reads data from anywhere (e.g. raw data, external files, spreadsheets, etc.) SAS will store the data in its own special form called a SAS Data Set.
• The file extension of a SAS Data Set in SAS 9.4 is .sas7bdat
• Only SAS can read and write SAS Data Sets
• If you opened a SAS Data Set with a different program (Microsoft Word, for example), the output would not be understandable as there would be funny looking graphics characters interspersed throughout the output making the output look like nonsense.
SAS Data Sets and SAS Data Types
• The good news is you don’t need to worry about how SAS stores its data or the structure of a SAS Data Set.
• However, it is important to understand that SAS Data Sets contain two parts: ➢ A descriptor portion
➢ A data portion
• Not only does SAS store the actual data values for you, it stores information about these values, such as storage lengths, labels, and formats.
SAS Data Sets and SAS Data Types
• SAS has only two types of variables: ➢ Numeric
➢ Character
• This makes SAS easier to comprehend when compared to other programming languages that have a multitude of data types, such as short, integer, long, float, double and logical)
SAS Data Sets and SAS Data Types
• SAS determines a fixed storage length for every variable
• Most SAS users never need to think about storage lengths for numerical values, as they are stored in 8 bytes (approximately 14 or 15 significant digits depending on your operating system)
• Each character value (data stored as letters, special characters, and numerals) is assigned a fixed storage length explicitly by your program statements or by various rules that SAS has about the length of character values
Q &A
Questions:
1. Identify which of the following variable names are valid SAS names: ▪ Height
▪ HeightInCentimeters ▪ Height_in_centimeters ▪ Wt-Kg
▪ x123y456
▪ 76Trombones ▪ MiXeDCasE
2. You have a data set consisting of Student ID, English, History, Math, and Science
test scores on 10 students.
▪ a. The number of variables is __________
▪ b. The number of observations is __________
3. What is the default storage length for SAS numeric variables (in bytes)?
Q &A
Answers:
1. Identify which of the following variable names are valid SAS names: ▪ Height
▪ HeightInCentimeters = valid
▪ Height_in_centimeters = valid
▪ Wt-Kg = Invalid (contains a dash)
▪ X123y456 = valid
▪ 76Trombones = Invalid (starts with a number) ▪ MiXeDCasE = valid
2. You have a data set consisting of Student ID, English, History, Math, and Science
test scores on 10 students.
▪ a. The number of variables is 5
▪ b. The number of observations is 10
3. What is the default storage length for SAS numeric variables (in bytes)? 8
Pause
SAS Programming
NEXT TOPICS
• Program to read raw data and generate a report • Enhancementstotheprogram
•
• •
Program to read raw data and generate a report The task:
• you have data values in a text file. These values represent Gender (M or F), Age, Height, and Weight.
Each data value is separated from the next by one or more blanks.
You want to produce two reports:
• one showing the frequencies for Gender (how many Ms and Fs);
• the other showing the average age, height, and weight for all the subjects.
Program to read raw data and generate a report
• Here is a listing of the raw data file that you want to analyze in the file called mydata.txt:
Program to read raw data and generate a report • Type out this SAS Program in the SAS: Program Editor
Program to read raw data and generate a report
data demographic;
infile ‘/home/yourNetID/Desktop/MySAS/Data/mydata.txt’; input Gender $ Age Height Weight;
run;
• This program consists of one DATA step followed by two PROC steps (FREQ and MEANS).
• The DATA step begins with the word DATA and in this program, the name of the SAS Data Set being created is Demographic.
Program to read raw data and generate a report
data demographic;
infile ‘/home/yourNetID/Desktop/MySAS/Data/mydata.txt’; input Gender $ Age Height Weight;
run;
• The next line (the INFILE statement) tells SAS where the data values are coming from.
• In this example, the text file mydata.txt is in the folder /home/yourNetID/Desktop/MySAS/Data/ on the BlueHive cluster system.
Program to read raw data and generate a report
data demographic;
infile ‘/home/yourNetID/Desktop/MySAS/Data/mydata.txt’; input Gender $ Age Height Weight;
run;
• The INPUT statement shown here is one of four different methods that SAS has for reading raw data.
• Column
• Formatted • List
• Named
Program to read raw data and generate a report
data demographic;
infile ‘/home/yourNetID/Desktop/MySAS/Data/mydata.txt’; input Gender $ Age Height Weight;
run;
• This program uses the list input method, appropriate for data values separated by delimiters.
• The default data delimiter for SAS is the blank. SAS can also read data separated by any other delimiter (for example, commas, tabs) with a minor change to the INFILE statement.
• When you use the list input method for reading data, you only need to list the variable names you want to give each data value.
Program to read raw data and generate a report
data demographic;
infile ‘/home/yourNetID/Desktop/MySAS/Data/mydata.txt’; input Gender $ Age Height Weight;
run;
• Notice the dollar sign ($) following the variable name Gender. The dollar sign following variable names tells SAS that values for Gender are character values.
• Without a dollar sign, SAS assumes values are numbers and should be stored as SAS numeric values.
Program to read raw data and generate a report
data demographic;
infile ‘/home/yourNetID/Desktop/MySAS/Data/mydata.txt’; input Gender $ Age Height Weight;
run;
• Finally, the DATA step ends with a RUN statement.
• Depending on what platform you are running your SAS program, RUN statements are not always necessary, but it is good practice to use them.
• In this program we placed a blank line between each step to make the program easier to read. Feel free to include blank lines whenever you wish to make the program more readable.
Program to read raw data and generate a report
title “Gender Frequencies”; proc freq data=demographic;
tables Gender; run;
• There are several TITLE statements in this program.
• The text following the keyword TITLE (placed in single or double quotes) is printed at the top of each page of SAS output.
Program to read raw data and generate a report
title “Gender Frequencies”; proc freq data=demographic;
tables Gender; run;
• Statements such as the TITLE statement are called global statements. The term global refers to the fact that the operations these statements perform are not tied to one single DATA or PROC step. They affect the entire SAS environment.
• In addition, the operations performed by these global statements remain in effect until they are changed.
• For example, if you have a single TITLE statement in the beginning of your program, that title will head
every page of output from that point on until you write a new TITLE statement.
• It is a good practice to place a TITLE statement before every procedure that produces output to make it easy for someone to read and understand the information on the page. If you exit your SAS session, your titles are all reset and you need to submit new TITLE statements if you want them to appear.
Program to read raw data and generate a report
title “Gender Frequencies”; proc freq data=demographic;
tables Gender; run;
• The FREQ procedure (also called PROC FREQ) is one of the many built-in SAS procedures. As the name implies, this procedure counts frequencies of data values.
• To tell this procedure which variables to count frequencies on, you add an additional statement—the TABLES (or TABLE) statement.
• Following the word TABLES, you list those variables for which you want frequency counts. You could actually omit this statement but, if you did, PROC FREQ would compute frequencies for every variable in your data set.
Program to read raw data and generate a report
title “Summary Statistics”; proc means data=demographic;
var Age Height Weight; run;
• PROC MEANS is another built-in SAS procedure that computes means (averages) as well as some other statistics such as the minimum and maximum value of each variable.
• A VAR (short for variables) statement supplies PROC MEANS with a list of analysis variables (which must be numeric) for which you want to compute these statistics.
• Without a VAR statement, PROC MEANS computes statistics on every numeric variable in your data set.
Program to read raw data and generate a report
title “Summary Statistics”; proc means data=demographic;
var Age Height Weight; run;
Program to read raw data and generate a report
Note: The title is centered because by default SAS centers all output. If you wanted to have the output be left-justified, then you would use the system option called NOCENTER, as was done in the initial data=veg SAS program.
Program to read raw data and generate a report SAS: Log
First, you see that the data came from the mydata.txt file located in the
/home/yourNetID/Desktop/MyS AS/Data
folder.
Program to read raw data and generate a report SAS: Log
Next, you see a note showing that five records (lines) of data were read and that the shortest line was 11 characters long and the longest was 13.
Program to read raw data and generate a report SAS: Log
• The next note indicates that SAS created a data set called Work.Demographic.
• The Demographic part makes sense because that is the name you used on the DATA statement.
• The Work part is the way SAS tells you that this is a temporary data set—when you end the SAS session, this data set will no longer exist.
• We see later how to make SAS data sets permanent.
• Also, as part of this note, you see that the Work.Demographic data set has five observations and four variables.
Program to read raw data and generate a report SAS: Log
The remaining notes show the real and CPU time used by SAS to process each procedure.
Enhancements to the program
Let’s enhance the program by adding a comment statement and computing a new variable (BMI) based on the height and weight data
Enhancements to the program
• The statement beginning with an asterisk (*) is called a comment statement.
• It enables you to include comments for yourself or others reading your program later.
Enhancements to the program
• The /* combination can also be used to comment any section of text until it reaches the end with a */ combination
• This could have been /*Program name: bla bla bla October 2019*/
Enhancements to the program
• The statement that starts with BMI= is called an assignment statement. It is an instruction to perform the computation on the right-hand side of the equal sign and assign the resulting value to the variable named on the left.
• In this example, you are creating a new variable named BMI that is defined as a person’s weight (in kilograms) divided by a person’s Height (in meters) squared.
• BMI (body mass index) is a useful index of obesity.
• Medical researchers often use BMI when computing the health risks of various diseases (such as heart attacks).
Enhancements to the program
The BMI assignment statement uses three of the basic arithmetic operators used by SAS:
• the forward slash (/) for division
• the asterisk (*) for multiplication
• and the double asterisk (**) for exponentiation
Here is the full set of arithmetic operators (the same rules we learned about for the order of algebraic operations in school also apply to SAS arithmetic operators):
Q &A
Questions:
1. You have a text file called stocks.txt containing a stock symbol, a price, and the number of shares. Here are some sample lines of data:
File stocks.txt
AMGN 67.66 100
DELL 24.60 200 GE 34.50 100 HPQ 32.32 120 IBM 82.25 50 MOT 30.24 100
a) Using this raw data file, create a temporary SAS data set (Portfolio). Choose your own variable names for the stock symbol, price, and number of shares. In addition, create a new variable (call it Value) equal to the stock price times the number of shares. Include a comment in your program describing the purpose of the program, your name, and the date the program was written.
b) Write the appropriate statements to compute the average price and the average number of shares of your stocks.
Stay Safe, Healthy & Motivated