MET MA 603: SAS Programming and Applications
MET MA 603:
SAS Programming and Applications
The Sort Procedure
1
1
Sorting Datasets
When a SAS dataset is created, the order of the observations is determined by when they are written to the output dataset (or if coming from an external file, the order follows that of the external file).
Sorting, or collating, refers to changing the order of the observations in a dataset.
Sorting is based on one or more variables in the dataset.
2
2
The Sort Procedure
The Sort Procedure orders a dataset according to one or more variables. The example below uses the input dataset “info”. The out= option is used to create an output dataset containing the sorted data (if out= is left out, the sorted dataset will overwrite the input dataset).
The BY statement is required. It indicates which variables to sort by. The default sort order is ascending (low to high). Use the key word descending to sort from high to low.
Basic Examples of Proc Sort:
Proc Sort data=info out=info_sorted ;
by num ;
run ;
Proc Sort data=info out=info_sorted ;
by descending num ;
run ;
3
3
Practice
The Accounts.sas7bdat dataset has bank account information. Sort the dataset by account number and LastName. Do not overwrite the original dataset!
Do the results look as you would expect?
4
4
Rules for Character and Numeric Data
The sort order for numeric data is intuitive, since numbers already have an established order on the real number line. Since Dates are represented by numbers, they follow the rules of numeric sorting.
Character data is more complicated, since it has to account for letters, digits, and symbols.
Digits come before letters, and capital letters come before lower-case letters.
By convention, missing data always comes first in an ascending sort (and last in a descending sort).
5
5
Sort Sequence
Sort Sequence is the set of rules used for sorting character data is. Different languages and computer systems follow different rules.
When SAS is run on Windows, the default sorting sequence for character data is based on the ASCII code.
6
6
Sort Sequence Options
The option SORTSEQ=LINGUISTIC changes the collation rules to be based on the language that was selected during SAS installation, rather than on ASCII code.
With SORTSEQ=LINGUISTIC, SAS will not differentiate between capital and lowercase letters when sorting.
Sometimes, we store digits as character data, but we want to sort them in numeric order (so that 11 comes before 100, for example). This is done using the additional modification numeric_collation=on.
Proc Sort data=txt_info out=txt_info_sorted
SORTSEQ=LINGUISTIC (numeric_collation=on);
by txt ;
run ;
7
7
Eliminating Duplicate Observations
Sometimes observations have the same values for one or more variables. The sort option NODUPKEY eliminates observations that have the same values for all of the variables listed in the BY statement.
Use the option DUPOUT=dataset to create a dataset which contains all observations that were eliminated due to being duplicates.
8
8
Practice
The Accounts.sas7bdat dataset has bank account information. Sort the dataset by account number, with the earliest account (the one with the smallest account number) coming first, and the newest account (the one with the biggest account number) coming last. Some accounts have more than one name on the account – eliminate one of the duplicate records and create a file that contains the duplicate records.
Next, sort the duplicate record dataset by LastName in alphabetical order (that is, instruct SAS to ignore case).
9
9
Readings
Textbook section 4.3, 4.4
10
10
1c1c2a1c
2c1d3a1d
2a2c1c2a
3a2a2c2c
3d3a3d3a
1d3d1d3d
Sorted by Number
then LetterUnsortedSorted by NumberSorted by Letter
/docProps/thumbnail.jpeg