/* REVIEW OF THE BASICS OF GETTING DATA INTO SAS
MULTIPLE METHODS FOR GETTING DATA INTO SAS
1. ENTER DATA DIRECTLY INTO THE PROGRAM
2. CREATE FROM RAW FILES LOCATED SEPARATELY FROM THE PROGRAM
3. CONVERT FROM OTHER SOFTWARE
4. READ OTHER SOFTWARE’S DATA DIRECTLY
– TWO TYPES OF DATA
1. NUMERIC
2. CHARACTER
*/
*TOPICS: ENTERING DATA DIRECTLY, CASE SENSITIVITY IN SAS, TITLES;
*What is SAS case sensitive about?;
data gpa;
input F_Name $ ID Gender $ GPA Height Weight;
datalines;
Hector 123 M 3.5 59 155
Nancy 328 f 3.7 52 99
Edward 747 M 2.4 62 205
Michelle 778 F 3.0 54 115
Sampson 289 m 3.5 60 180
;
run;
*What do you expect the distribution of gender to be in the above gpa1 set?;
/* SHOULD USE TITLES TO INDICATE WHAT IS BEING OUTPUTTED –
EXPECTED TO RECEIVE FULL CREDIT FOR CORRECT OUTPUT */
*Get the frequency distribution of gender;
title ‘Gender Frequencies for GPA Data’;
proc freq data=gpa;
table gender;
run;
*What will the title be on the next set of output?;
*Where does the data set gpa live?;
/* TOPIC: TWO TYPES OF DATA SETS: PERMANENT AND TEMPORARY
– MUST CREATE A SAS LIBRARY EACH SESSION, TO INDICATE WHERE THE DATA ARE BEING STORED,
BUT DO NOT HAVE TO CREATE DATA SETS EACH SESSION IF THEY ARE PERMANENT
– CAN CHANGE THE NAME OF THE LIBRARY EACH SESSION, AS LONG AS THE PATH STAYS THE SAME.
*/
/* As mentioned in prior lecture, we will learn more about macro variables later in the course, but need
to set the pathname to connect to where the data live on your computer. i.e. for EC it is*/
/*SUBSTITUTING TEXT WITH %LET
%LET is a macro statement that assigns a value to a macro variable. Despite its simplicity, by
learning this one statement you will greatly increase the flexibility of your programs.
Suppose you have a program that you run once a month. Every time you have to edit the program so it will select data for the
correct month and print the correct title. This is time-consuming and prone to errors. You can use %LET to create a macro
variable. Then you can change the value of the macro variable in the %LET statement, and SAS will repeat the new value
throughout your program.
The general form a %LET statement is
%LET macro-variable-name = value;
where macro-variable-name is a name you make up following the standard rules for SAS names (32 characters or fewer in
length; starting with a letter or underscore; and containing only letters, numerals, or underscores). Value is the text to be
substituted for the macro variable name, and can be up to 64,000 characters long. The following statements each create a
macro variable.
%LET iterations = 5;
%LET winner = ;*/
%let pathname=C:\BIS679A\BIS679A_9_13_2021\;
*Example of creating permanent data set;
*(Taken from SAS little book: “SAS data sets have a two-level name such as WORK.BANANA.
The first level is called its libref (short for SAS data library reference),
and the second level is the member name that uniquely identifies the data set within the library.
A SAS data library is a collection of SAS data sets residing in the same location.
So a libref is like a nickname for that location.
Sometimes a libref refers to a physical location, such as a hard drive or flash drive.
Other times it refers to a logical location such as a directory or folder”;
*Define a library;
libname one “&pathname”;
*To call a macro variable, use the ampersand and
put in double quotation marks (if need quotation marks). Here the libname “one” is
located on my C drive;
*Creating a permanent data set;
*Use the set statement to include in the new data set all that is contained in the old data set;
data one.gpa; set gpa;
run;
*Same path, but different library name;
libname two “&pathname”;
/* TOPIC: FUNCTIONS
FUNCTIONS HAVE ARGUMENTS SURROUNDED BY PARENTHESES
SAS HELP MENU: FUNCTIONS (EX: DATE/TIME; MATHEMATICAL)
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245860.htm
*/
*Using the upcase function to make all of the values of gender equivalent;
data gpa1; set gpa;
gender=upcase(gender); *reassign to the same variable name. Is this a good idea?;
run;
*Is it a good idea to replace the data set? How would you do that?;
*What is the distribution of gender now?;
proc freq data=gpa1; tables gender; run;
/* TOPIC: GETTING A SUMMARY OF THE DATA SET – PROC CONTENTS */
* Use proc contents to get information about the data set, use varnum option if want variables printed
as they appear in the data set;
title ‘Contents of GPA data set’;
proc contents data=gpa1 /*order=varnum*/;
run;
/*TOPIC: NAMING OF DATA SETS */
* What happens if you do not name the data set?;
data;
input F_Name $ ID Gender $ GPA Height Weight;
datalines;
Hector 123 M 3.5 59 155
Nancy 328 f 3.7 52 99
Edward 747 M 2.4 62 205
Michelle 778 F 3.0 54 115
Sampson 289 m 3.5 60 180
;
run;
/* TOPIC: FREE FORMATTED INPUT – FROM A DATA FILE*/
*(1) In the free format, there is no skipping over values and all missing values
must be represented by a period;
data fuel;
infile “&pathname.fuel_consumption1.txt”;
*Need to put a period after the macro variable name so that SAS knows where the name ends;
input State $ pop tax nlic inc road fuelc dlic fuel;
run;
title ‘Fuel Data – Consumption 1’;
proc print data=fuel;
run;
*(2) Reading in data using the column designation;
data fuel1;
infile “&pathname.fuel_consumption2.txt”;
input State $ 1-2 pop 4-7 tax 9-12 nlic 14-17 inc 19-23 road 25-28 fuelc 30-33 dlic 35-38 fuel 40-42;
run;
*What do you notice about the data and significant digits?;
*What do you think the problem is? How would you possibly fix it?;
title ‘Fuel Data’;
title2 ‘Consumption 2’;
proc print data=fuel1;
run;
* (3) Reading a subset of the variables – for example, creating an analysis data set;
*What is an analysis data set?;
data fuel3;
infile “&pathname.fuel_consumption2.txt”;
input State $ 1-2 pop 4-7 road 25-29 fuel 41-43;
run;
title “Subset of the Variables in the Fuel Data Set”;
proc print data=fuel3;
run;
*What is another way that you could create a subset of a data set?;
/* TOPIC: INFORMATS
INFORMATS CAN BE USED IN A LOT OF DIFFERENT SETTINGS, BUT THEY ARE MOST IMPORTANT
WHEN THE DATA IS NOT STRAIGHTFORWARD
EXAMPLES: DATES, COMMA SEPARATED NUMBERS, DOLLAR VALUES (WITH DOLLAR SIGN) */
*What are the potential issues with the GRADES.txt data set?
*If we use no informats to read in the data;
data grade;
infile “&pathname.GRADES.txt” Firstobs=2;
input id 1-4 sex $ 9-14 @17 testdate score1 33-34 score2 41-42 score3 49-50 @57 salary ;
run;
*Look at the contents;
title “Contents of GRADES data set”;
proc contents data=grade;
run;
*Sometimes just looking at the contents does not give you the full picture – should print the data set (if small) or a
few observations of the data set (if large)
WHY?;
title “GRADES data sets”;
proc print data=grade;
run;
/* http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a001239776.htm */
*Reading in the data using the proper informats;
data grades;
infile “&pathname.GRADES.txt” Firstobs=2;
input id 1-4 sex $ 9-14 @17 testdate MMDDYY10. score1 33-34 score2 41-42 score3 49-50 @57 salary COMMA6. ;
run;
title ‘Grades Data – Using Informats’;
proc print data=grades;
run;
*What is the value of testdate?
*How do we get the data to appear the way we want it to?
/*TOPIC: FORMATS */
/* THERE IS A DIFFERENCE BETWEEN INFORMATS AND FORMATS */
*Including formats in the data set;
data grades1; set grades; *Create a new data set with the formats;
format testdate mmddyy10. salary comma6.;
run;
title ‘Grades Data – Using Informats and Formats’;
proc print data=grades1;
run;
/*TOPIC: MULTIPLE LINES OF DATA PER OBSERVATION */
data multi;
input patient age / rep1 / rep2 / rep3 / rep4;
datalines;
1 35
130
140
150
160
2 45
190
185
175
160
3 19
110
110
110
110
;
run;
title “Multiple Lines of Data for a Single Observation”;
proc print data=multi;
run;
/* TOPIC: LABELS
– IF YOU USE LABELS IN THE DATA SET, THE LABEL IS PERMANENT.
IF YOU USE A LABEL IN THE PROCEDURE STEP, ONLY FOR THAT PROCEDURE */
*Why would we want to use labels?;
title ‘GPA data with gender label’;
proc freq data=gpa1;
table gender;
label gender “Gender: M=Male, F=Female”;
run;
*Why didn’t the above statement run? How do we fix it?;
*What happended to the label?;
title “Contents of the GPA data”;
proc contents data=gpa1;
run;
*Creating a permanent label in the data step;
data gpa2; set gpa1;
label gender=”Gender: M=Male, F=Female”;
run;
title “Data with permanent label”;
proc freq data=gpa2;
table gender;
run;
title “Contents of gpa2 data”;
proc contents data=gpa2;
run;
/* TOPIC: READING DELIMITED FILES
SAS CAN RECOGNIZE ANY DELIMITER – IT JUST NEEDS TO KNOW WHAT IT IS
IF HAVE A TAB DELIMITED FILE: USE DLM=’09’X
*/
/* OPTION 1: DLM OR DELIMITER OPTION */
*Comma Delimited Data;
data salary;
infile “&pathname.salary.csv” DLM=’,’ firstobs=2;
input obs degree $ rank $ sex $ year salary;
run;
title “Salary Data”;
proc print data=salary;
run;
/*SAS WILL INTERPRET 2 OR MORE DELIMITERS IN A ROW AS A SINGLE DELIMITER (WILL NOT TREAT AS MISSING DATA)*/
data test;
infile “&pathname.TEST.txt” DLM=’,’ firstobs=2;
input Name $ Test1-Test4;
run;
title “Test data”;
proc print data=test;
run;
*Why does the data only contain 3 rows?;
/*OPTION 2: DSD (DELIMITED SEPARATED DATA) OPTIONS
DSD:
1. IGNORES DELIMITERS IN QUOTES
2. DOES NOT READ QUOTES AS PART OF THE DATA
3. TREATS TWO DELIMITERS IN A ROW AS A MISSING VALUE
4. DEFAULT IS A COMMA SEPARATED FILE
*/
data test1;
infile “&pathname.TEST.txt” DSD firstobs=2;
input Name $ Test1-Test4;
run;
title “Test Data using the DSD option”;
proc print data=test1;
run;
/* IF THE DELIMTER IS NOT A COMMA, HAVE TO USE THE DLM OPTION WITH DSD */
data cancer;
infile “&pathname.cancer.txt” DSD DLM=”#” firstobs=2; *What does the firstobs option do?;
input Name $ Cancer $ Treatment $;
run;
*What is wrong with the output?;
title “Cancer Data Set with # delimiter”;
proc print data=cancer;
run;
*How do we read the data in correctly?;
/* TOPIC: IMPORTING DATA INTO SAS USING PROC IMPORT
– DEFAULT IS TO GET THE VARIABLE NAMES FROM THE FIRST LINE
– READS LENGTH AS IT SHOULD
– RECOGNIZES TYPES OF VARIABLES – CHARACTER VERSUS NUMERIC
– CAN RECONIZE SOME FORMATS. EXAMPLE: DATES
DATAFILE: WHAT FILE TO READ IN
DBMS: TYPE OF FILE (CSV: .CSV (COMMA SEPARATED VALUE); TAB: .TXT (TAB DELIMITED); DLM: ALL OTHERS INCLUDING SPACE)
OUT: NAME OF THE SAS DATA SET TO BE CREATED
*/
*This is an example where it is SUPER important to check the log;
proc import datafile=”&pathname.TEST.txt” DBMS=CSV out=test1;
run;
*How do we resolve the issue?;
*Another example where there are no variable names and a different delimiter;
proc import datafile=”&pathname.cancer1.txt” replace DBMS=DLM out=cancer2;
delimiter=’#’;
getnames=no;
run;
title “Cancer data with no names”;
proc print data=cancer2;
run;
/*TRY TO ENTER USING IMPORT WIZARD (CALL DATA SET CANCER 3)
NOTE: MUST ENTER THE DELIMITER OPTIONS – CAN ONLY HAVE ONE CHARACTER IN THE DELIMITER –
OTHERWISE MUST USE THE IMPORT STATEMENT
*/
title “Cancer data using Import Wizard”;
proc print data=cancer3;
run;
/*CAN CONVERT OTHER SOFTWARE’S DATA FILE INTO SAS DATA FILE
CAN USE IMPORT STATEMENT OR THE IMPORT WIZARD*/
proc import datafile=”&pathname.Penicillin.xlsx” out=penicillin replace;
sheet=”Penicillin$”;
label sample=”Batch of the sample”;
run;
title “Penicillin Data”;
proc print data=penicillin;
run;
title “Contents of Penicillin Data”;
proc contents data=penicillin;
run;
/*TOPIC: CAN EXPORT DATA FROM SAS – USING EXPORT STATEMENT OR EXPORT WIZARD*/
*Creating a comma delimited file;
proc export data=penicillin outfile=”&pathname.Penicillin.csv”;
run;
/*TOPIC: READING PART OF THE DATA SET IN (ONLY THAT WHICH YOU ARE CONCERNED WITH)
USING THE IF STATEMENT – A SPECIAL CASE OF THE COLUMN POINTER (TELLS SAS TO HOLD WHILE CHECKING
THE OBSERVATION. THE “THEN” IS IMPLIED. IF THE STATEMENT IS TRUE, THEN THE OBSERVATION IS OUTPUT INTO
THE DATA SET
*/
data qol; *Make sure to look at the log;
infile “&pathname.TwoQOL.csv” DSD firstobs=2;
input patno ef indef @;
if indef = 1;
input treat perf server;
run;
*What happens if we forget the @ sign?;
data qol1; *Make sure to look at the log;
infile “&pathname.TwoQOL.csv” DSD firstobs=2;
input patno ef indef;
if indef = 1;
input treat perf server;
run;
/*TOPIC: USING THE DELETE OPTION */
data qol2; *Make sure to look at the log;
infile “&pathname.TwoQOL.csv” DSD firstobs=2;
input patno ef indef @;
if indef = 0 then delete;
input treat perf server;
run;
title ‘QOL data set’;
proc print data=qol2 (obs=20); *Only print the first 20 observations instead of the entire data set;
run;
/*TOPIC: Using a permanent data set in SAS */
* There is a birthday data (birthday.sas7bdat) set located on the course website. This is a SAS permanent data set.
Use the birthday data set to calculate the age of the individuals in the
data set (using today’s date) and calculate the mean, median, standard deviation of the
birthday of individuals in the data set.;
libname one “&pathname”;
data birth; set one.birthday; *Note that years in SAS are 365.25 days and months are 30.5 days!;
age=(today()-birthday)/365.25; *Not reproducible on another day;
age1=(mdy(9,10,2019)-birthday)/365.25;
run;
title “Mean Age”;
proc univariate data=birth;
var age1;
run;