MET MA 603: SAS Programming and Applications
MET MA 603:
SAS Programming and Applications
Summarizing Datasets
1
1
Summarizing Datasets
When working with large datasets, it is not always practical or possible to try to understand the information in the dataset at a glance. Calculating summary statistics can server the following purposes:
Discover any data entry errors or anomalies contained within the dataset.
Understand the distribution of the variables in the dataset.
Perform a preliminary exploration of the dataset before conducting a more detailed analysis.
2
2
The Frequency Procedure
The Frequency Procedure produces statistics about the distributions of variables in a dataset.
The Tables statement is an optional statement which instructs SAS which variables to include in the distribution analysis. Single variables and/or combination of variables may be listed.
proc freq data = data.occupancy ;
tables smokers residents dogbreed * dogs ;
run;
3
3
The Frequency Procedure (cont.)
The Frequency Procedure produces statistics about the distributions of variables in a dataset.
Additional options must follow a backslash (/) symbol.
The LIST option prints two-way distribution analyses in database form rather than in a two-way table.
NOPERCENT suppresses printing of percentages.
MISSING includes missing values in the distribution tables.
OUT=name creates a SAS dataset called name containing the distribution data, which can then be used in Data steps.
proc freq data = data1.occupancy ;
tables dogbreed * dogs / list nopercent missing out=occupancy_freq;
run;
4
4
Practice
Using the Golf.sas7bdat dataset, use Proc Freq to create an output that matches what is on the right.
5
5
The Means Procedure
The Means Procedure produces summary statistics about the variables in a dataset.
The default statistics included are the number of observations (n), Mean, Standard Deviation (stddev), and Min and Max values.
If specific statistics are listed in the Proc Means statement, only those statistics will be included in the analysis.
Options statistics include the Median, Mode, Range (Max – Min), and Sum.
Maxdec=d can be included in the Proc Means statement to specify the number of decimal places to print.
proc means data = data1.occupancy mean median mode range ;
run;
6
6
The Means Procedure (cont.)
The BY statement performs the analysis separately for the listed variables. The dataset must be sorted according to the BY variables.
The VAR statement analyzes only on the listed variables.
The CLASS statement performs the analysis separately for the listed variables and prints the results in a single table. The data does not need to be sorted.
The BY and CLASS statements should not both be used in the same Means Procedure.
The optional statement Output out=name creates a SAS dataset with the default means statistics.
proc means data = data1.occupancy maxdecs=2 n min max sum;
output out=occupancy_stats;
By dogbreed ;
Var residents dogs ;
run;
7
7
Practice
Using the Golf.sas7bdat dataset, use Proc Means to create an output that matches what is below.
8
8
Practice
Using the Scores1.sas7bdat dataset, use Proc Means to calculate the following statistics for the score of each school: average, standard deviation, minimum, and maximum.
Do not display any decimal places in the result.
9
9
Readings
Textbook sections 4.10, 4.12
10
10
/docProps/thumbnail.jpeg