CS计算机代考程序代写 data science finance ER Excel Chapter 01

Chapter 01

Advanced Data analysis and
Reporting

1Page© Department of Statistics STATS 301

Page© Department of Statistics STATS 301

Learning outcomes

2

• Creating reports:

• Using PROC MEANS / PROC SUMMARY

• Using PROC FREQ

• Using PROC TABULATE

• More about FORMAT and INFORMAT.

• Revisiting some concepts

• More on Array, i.e. working with multiple variables in a systemic way

• ARRAY

• Temporary ARRAY

• Hash tables

Page© Department of Statistics STATS 301

Summarising data

3

• DATA STEPS, PROC PRINT, PROC MEANS, PROC SUMMARY, PROC FREQ, and
PROC TABULATE can be used to summarise data and they are essential to create
reports.

• As a statistician/data scientist you will been frequently using these tools for
different purposes.

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

4

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

5

• “NZ_Smoking.xlsx” file contains the statistics for the smoking behaviour of NZ
people.

• The data are based on 2006 and 2013 Censuses.

• The data are across age group, area, ethnic and gender.

• Based on the data, we can answer many questions by summarising data.

• What is the distribution of smokers across different age groups?

• Is there any difference between males and females?

• Is there any difference between different areas?

• Is there any difference between different areas, and across different sexes?

• …

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

6

• First we read data into SAS.

• We use PROC IMPORT

• Use the SHEET statement to import a specific sheet.

• Since data are saved in an XLSX file we can use LIBNAME to have a better look

at the excel file.

/* importing the smoking behaviour data */
filename nz_smk ‘~/STATS301/Chapter01/NZ_Smoking.xlsx’;
Proc import datafile=nz_smk out=chap01.nz_smoking dbms=xlsx
replace;

getnames=yes;
run;

/* we can import specific sheet */
proc import datafile=nz_smk out=age_group dbms=xlsx replace;

getnames=yes;
sheet=’age_group’;

run;

/* For XLSX file we have a better way! */
/* using libname to see other sheets */
libname myxl xlsx ‘~/STATS301/Chapter01/NZ_Smoking.xlsx’;

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

7

• You can use PROC MEANS to create some tabular output.

proc means data=chap01.nz_smoking sum;
where year=2006;
class smoking_behaviour;
var value;

run;

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

8

/* across years */
proc means data=chap01.nz_smoking sum;

class smoking_behaviour year;
var value;

run;

Page© Department of Statistics STATS 301

Exercise-01

9

/* across years */
proc means data=chap01.nz_smoking sum;

class smoking_behaviour year;
var value;

run;

• What we should change if we are interested in using PROC SUMMARY?

Page© Department of Statistics STATS 301

PROC FREQ

10

• Recall:

• The FREQ procedure produces one-way to n-way frequency and contingency

(crosstabulation) tables.

• As you may know PROC FREQ can be also used to do more complicated
statistical analysis.

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

11

/* since value contains the frequency of each group, we must use
WEIGHT statement */

proc freq data=chap01.nz_smoking;
where year=2006;
table smoking_behaviour;
weight value;

run;

proc freq data=chap01.nz_smoking;

table smoking_behaviour*year;
weight value;

run;

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

12

Cells Guideline

Why the var names
are like this?!

Page© Department of Statistics STATS 301

Renaming and Labelling variables

13

• Recall:

• The RENAME statement, as you may recall, can be used to rename variables.

• The LABEL statement can be used to specify a descriptive label for variable

names.

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

14

data chap01.nz_smoking_labeled;
set chap01.nz_smoking;
label age_group=’Age Group’

area_code=’Area Code’
ethnic_group=’Ethnic Group’
smoking_behaviour=’Cigarette Smoking Behaviour’;

rename age_group=agegrp
area_code=area
ethnic_group=ethnic
smoking_behaviour=smoke
value=freqency;

run;

• Note that we first label the variable name based on old name and then rename the
variable names. The new labels transfer to the renamed variables.

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

15

• Now we use the new data set with new variable names.

• We can drop the calculation of total percentage.

proc freq data=chap01.nz_smoking_labeled;
table smoke*year;
weight frequency;

run;

/* total percentage may not be useful, thus
we remove it from being displayed */
proc freq data=chap01.nz_smoking_labeled;

table smoke*year/nopercent;
weight frequency;

run;

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

16

• Now we like to create more customised output.

• And create print-ready report.

• PROC TABULATE can be used to generate such a customised report.

• If you are interested in even more customisation, have a look at PROC REPORT.

Page© Department of Statistics STATS 301

PROC TABULATE

17

• The TABULATE procedure displays descriptive statistics in tabular format, using
some or all of the variables in a data set. You can create a variety of tables ranging
from simple to highly customised.

proc tabulate data=chap01.nz_smoking_labeled;
class year smoke;
table smoke,year;
freq frequency;

run;

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

18

proc tabulate data=chap01.nz_smoking_labeled;
class year smoke agegrp;
table smoke,agegrp*year;
freq frequency;

run;

proc tabulate data=chap01.nz_smoking_labeled;

class year smoke agegrp;
table smoke*agegrp,year;
freq frequency;

run;

proc tabulate data=chap01.nz_smoking_labeled;

class year smoke agegrp gender;
table smoke,year*(agegrp gender);
freq frequency;

run;

• Generating more complicated reports.

Page© Department of Statistics STATS 301

Exercise-02

19

proc tabulate data=chap01.nz_smoking_labeled;
class year smoke agegrp gender;
table smoke,(agegrp gender)*year;
freq frequency;

run;

• How the output would change if we change the order of “year” and “(agegrp
gender)” in the TABLE statement?

New order

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

20

• Still there are few things we like to change:

• The numbers inside the tabular report are too large, we would like to format

them.

• We can define how to format values in cells. Use the asterisk (*) operator to
associate a format modifier with the element (an analysis variable or a
statistic) that produces the cells that you want to format. Format modifiers
have the form F=

• We need to record the values of the variables. For example 0=“Female” and
1=“Male”, etc.

• We can use PROC FORMAT.

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

21

/* changing the format of numbers inside the tabluar report */
proc tabulate data=chap01.nz_smoking_labeled;

class year smoke agegrp gender;
table smoke,year*f=comma11.*(agegrp gender);
freq frequency;

run;

proc format ;

value gender 0=’Female’ 1=’Male’;
run;
proc tabulate data=chap01.nz_smoking_labeled;

format gender gender.;
class year smoke gender;
table smoke,year*f=comma11.*gender;
freq frequency;

run;

Page© Department of Statistics STATS 301

PROC FORMAT

22

• We have the format information of all variables in different sheets in the excel file.

• We should use this to create all formats.

• The PROC FORMAT’s CNTLIN option let user to use a SAS data set to create
formats.

• CNTLIN: specifies a SAS data set from which PROC FORMAT builds
informats or formats.

• Data set that is used to create formats and informats must have the
following variable names:

• FMTNAME, START, LABEL.

• TYPE variable can be used to specify the type of data

• C : specifies a character format.

• I : specifies a numeric informat.

• J : specifies a character informat.

• N : specifies a numeric format.

Page© Department of Statistics STATS 301

PROC FORMAT

23

• What is the difference between FORMAT and INFORMAT?

Page© Department of Statistics STATS 301

PROC FORMAT

24

data formatdata;
input FMTNAME $ START LABEL $;
cards;

gender 0 Female
gender 1 Male
;
run;
proc format cntlin=formatdata;
run;
data example;

input sex age;
format sex gender.;
cards;

0 12
1 14
0 15
1 15
0 13
1 13
;
run;
proc print data=example;
run;

Page© Department of Statistics STATS 301

Example-NZ Smoking Behaviour

25

data smoke_format;
length name $17. craw $51.;
set myxl.age_group myxl.area_code myxl.ethnic_group

myxl.gender myxl.smoking_behaviour;
fmtname=name ;
start=code;
label=craw;

run;
proc format cntlin=smoke_format;
run;
proc tabulate data=chap01.nz_smoking_labeled;

format gender gender. smoke smoking_behaviour.;
class year smoke gender;
table smoke,year*f=comma11.*gender;
freq frequency;

run;

• Note we use the LENGTH statement during smoke_format data step creation to be
sure that the long names are not truncated.

• Also note that we need to specify the LIBNAME for data sets (myxl.).

Page© Department of Statistics STATS 301

Example-01

26

• “death_reg.sas7bdat” data set contains the date of death for the people who died in
2018 in city A.

• At the beginning of 2018 the population of city A was 47,100 ( Female: 24,021, Male:
23,079). And at the beginning of 2019 the population of city A was 47943 (Female:
24,521, Male:23,422).

• In this example we want to calculate the monthly death ratio of the city.

Page© Department of Statistics STATS 301

Example-01

27

/* example 01 */
proc summary data=chap01.death_reg;

format d_death month.;
class d_death sex;
types d_death*sex;
output out=monthly_death;

run;

• We need to find the monthly population.

• We assume that the population increases linearly.

• We need to calculate the population for each month.

• Female population in 2018 is 24,021 and in 2019 is 24,521. Thus, 500
increments in population size is divide equally among different months.

• We need to repeat a task for multiple times.

Page© Department of Statistics STATS 301

Do Loops

28

• The DO loop executes statements between DO and END repetitively based on the
value of an index variable.

• The syntax is:

• DO index-variable= start TO end ;

• …more SAS statements…

• END;

Page© Department of Statistics STATS 301

Do Loops

29

• Who knows how we can break a LOOP in SAS?

Page© Department of Statistics STATS 301

Do Loops

30

• A simple example:

• Sally ran 30 miles per week. She plans to increase her mileage by 4% each week for

six weeks.

data training;
/* why retain? is it needed? */
retain miles 30;
do week=1 to 6;

miles=miles+miles*0.04;
end;

run;

Page© Department of Statistics STATS 301

Exercise-03

31

• Look at the data which we created in the previous slide.

• Why has the data set only one observation?

• How can we save all mileage for all weeks?

Page© Department of Statistics STATS 301

Example-01

32

data monthly_pop;

start_male=23079;
end_male=23422;
start_female=24021;
end_female=24521;

diff_male=end_male-start_male;
diff_female=end_female-start_female;

do month=1 to 12 by 1;

sex=’Female’;
pop_size=start_female+(month-1)*diff_female/12;
output;
sex=’Male’;
pop_size=start_male+(month-1)*diff_male/12;
output;

end;
keep sex month pop_size;

run;

Page© Department of Statistics STATS 301

Example-01

33

/* why in the following code we don’t need BY statement? */
data combined;

merge monthly_pop monthly_death;
ratio=_freq_/pop_size;

run;

• An issue with our approach. The “month” variable in the monthly_pop data set is a
number rather than date.

• In the next slide we use the INTNX( ) function in SAS.

• INTNX( ): Increments a date, time, or datetime value by a given time interval,

and returns a date, time, or datetime value.

Page© Department of Statistics STATS 301

Example-01

34

data monthly_pop;
start_male=23079;
end_male=23422;
start_female=24021;
end_female=24521;

diff_male=end_male-start_male;
diff_female=end_female-start_female;
/* to write a date we should put a d at the end */
month=’01Jan2018’d;
do i=1 to 12 by 1;

sex=’Female’;
pop_size=start_female+(i-1)*diff_female/12;
output;
sex=’Male’;
pop_size=start_male+(i-1)*diff_male/12;
output;
month=intnx(‘month’,month,1);

end;
keep sex month pop_size;
format month date11.;

run;

Page© Department of Statistics STATS 301

Example-01

35

data combined;
merge monthly_pop monthly_death;
ratio=_freq_/pop_size;

run;

/* Using BY variable*/

data combined;

merge monthly_pop monthly_death(rename=(d_death=month));
format month month.;
by month sex groupformat;
ratio=_freq_/pop_size;

run;

• GROUPFORMAT uses the formatted values, instead of the internal values, of the BY
variables to determine where BY groups begin and end. Although the
GROUPFORMAT option can appear anywhere in the BY statement, the option
applies to all variables in the BY statement.

Page© Department of Statistics STATS 301

Managing multiple variables – ARRAY

36

• Recall:

• An ARRAY is a grouping of SAS variables that are arranged in a particular order

and identified by an array-name. The array exists only for the duration of the
current DATA step. The array-name distinguishes it from any other arrays in the
same DATA step; it is not a variable.

• ARRAY statement syntax in DATA STEP is:

• ARRAY array-name {number-of-elements} <$> <(initial-value-list)>;

Page© Department of Statistics STATS 301

One dimensional array

37

Hotel

Phone

Pers. Auto

Rental Car

Airfare

Dues

Registration
Fees
Other

Tips (non-meal)

Meals

1

hotel1

phone1

peraut1

carrnt1

airlin1

dues1

regfee1

other1

tips1

meals1

2

hotel2

phone2

peraut2

carrnt2

airlin2

dues2

regfee2

other2

tips2

meals2

3

hotel3

phone3

peraut3

carrnt3

airlin3

dues3

regfee3

other3

tips3

meals3

4

hotel4

phone4

peraut4

carrnt4

airlin4

dues4

regfee4

other4

tips4

meals4

5

hotel5

phone5

peraut5

carrnt5

airlin5

dues5

regfee5

other5

tips5

meals5

6

hotel6

phone6

peraut6

carrnt6

airlin6

dues6

regfee6

other6

tips6

meals6

7

hotel7

phone7

peraut7

carrnt7

airlin7

dues7

regfee7

other7

tips7

meals7

8

hotel8

phone8

peraut8

carrnt8

airlin8

dues8

regfee8

other8

tips8

meals8

1

2

3

4

5

6

7

8

9

10

First
Dimension

Expense
Categories

Second
Dimension

Days of the Week Total

Variables

1

misc1

Arrays

MISC

2

misc2

3

misc3

4

misc4

5

misc5

6

misc6

7

misc7

8

misc8

1

mday1MDAY

2

mday2

3

mday3

4

mday4

5

mday5

6

mday6

7

mday7

Page© Department of Statistics STATS 301

Two dimensional array

38

Hotel

Phone

Pers. Auto

Rental Car

Airfare

Dues

Registration
Fees
Other

Tips (non-meal)

Meals

1

hotel1

phone1

peraut1

carrnt1

airlin1

dues1

regfee1

other1

tips1

meals1

2

hotel2

phone2

peraut2

carrnt2

airlin2

dues2

regfee2

other2

tips2

meals2

3

hotel3

phone3

peraut3

carrnt3

airlin3

dues3

regfee3

other3

tips3

meals3

4

hotel4

phone4

peraut4

carrnt4

airlin4

dues4

regfee4

other4

tips4

meals4

5

hotel5

phone5

peraut5

carrnt5

airlin5

dues5

regfee5

other5

tips5

meals5

6

hotel6

phone6

peraut6

carrnt6

airlin6

dues6

regfee6

other6

tips6

meals6

7

hotel7

phone7

peraut7

carrnt7

airlin7

dues7

regfee7

other7

tips7

meals7

8

hotel8

phone8

peraut8

carrnt8

airlin8

dues8

regfee8

other8

tips8

meals8

1

2

3

4

5

6

7

8

9

10

First
Dimension

Expense
Categories

Second
Dimension

Days of the Week Total

Variables

1

misc1

Arrays

MISC

2

misc2

3

misc3

4

misc4

5

misc5

6

misc6

7

misc7

8

misc8

1

mday1MDAY

2

mday2

3

mday3

4

mday4

5

mday5

6

mday6

7

mday7

Page© Department of Statistics STATS 301

Recall an Example: Example-02

39

• Suppose we have the following data set, and we like to calculate the average score
of each person.

• Note that there are missing data.

• In our example any missing value should be replaced by 0.

data simple;
input name $ score1 score2 score3 score4 score5 score6;

cards;
name1 50 65 98 55 59 70
name2 89 95 99 99 100 100
name3 88 85 90 85 85 99
name4 44 54 60 60 . 55
;
run;

Page© Department of Statistics STATS 301

Recall an Example: Example-02

40

data simple;
set simple;
totalscore=0;
if score1=. then score1=0;
totalscore=totalscore+score1;
if score2=. then score2=0;
totalscore=totalscore+score2;
if score3=. then score3=0;
totalscore=totalscore+score3;
if score4=. then score4=0;
totalscore=totalscore+score4;
if score5=. then score5=0;
totalscore=totalscore+score5;
if score6=. then score6=0;
totalscore=totalscore+score6;
totalscore=totalscore/6;

run;

Page© Department of Statistics STATS 301

Recall an Example: Example-02

41

data simple;
set simple;

/* definition of an array */
array scores{6} score1-score6;
/* for each observation total is set to 0 and

then each score is added to total. When all
scores are added to total for a given observation
then the total is computed by total/6 , i.e. average

*/
total=0;
do i=1 to 6;

if scores{i}=. then scores{i}=0;
total=total+scores{i};

end;
total=total/6;

run;

Page© Department of Statistics STATS 301

NZ Income Example

42

Page© Department of Statistics STATS 301

Example-NZ Income

43

• nz_income.xlsx contains the income information for different gender in New Zealand
across different areas. In this example we want to read this data and find the
average income for each area in NZ and report the results.

• We will use PROC IMPORT to import this data set into SAS.

• In the following code DBMS indicates the type of the file that is going to be imported

into SAS:

• DBMS=XLSX is for Excel files (.xlsx extension)

• DBMS=XLS is for Excel files (.xls extension)

• DBMS=SAV is for SPSS files.

• DBMS=CSV is for csv files…

/* Importing an excel file */
filename nz_inc ‘~/STATS301/Chapter01/nz_income.xlsx’;

proc import datafile=nz_inc out=chap01.nz_income dbms=xlsx
replace;

getnames=yes; /* GETNAMES=yes means using the first row of
data for variable names*/
run;

Page© Department of Statistics STATS 301

Example-NZ Income

44

• By looking at the imported file, we observe the following issues:

• “area” variable has empty (missing) values.

• There is a variable called “C” which is empty for all observations.

• “income” is a character variable and is not suitable for arithmetic calculation.

Page© Department of Statistics STATS 301

Example-NZ Income

45

• We start with the simplest issue, deleting the empty column “C”

• We need to modify a SAS data set, since we already created it.

• To modify a data set, one way is to create a new data set based on the
current data set and during the creation of the new data set we apply the
changes. (since our data set is small that doesn’t hurt the performance)

• To do this we can use the SET statement inside DATA STEP.

• SET tells SAS to set a data set, and SAS will open the data set and read

all observations from that data set one by one (we can imagine that SET
implicitly creates a loop to go through all observations in a specific data
set).

• We need to tell SAS to “drop” a specific variable, i.e. “C”

Page© Department of Statistics STATS 301

Example-NZ Income

46

data chap01.nz_income;
drop c;
set chap01.nz_income;

run;

1. Recall:

1. First we tell SAS we want to create a new data set, however, we use the same

name as our current data set, thus, it looks we are modifying the data set!

2. DROP tells SAS to drop a specific variable (in this case “C”)

3. SET sets the data set we plan to modify (recall that, SAS implicitly create a loop

which goes through all observations in the data set and DATA STEP reads
those observations one by one and creates a new data set).

Page© Department of Statistics STATS 301

Example-NZ Income

47

• Now we fill the empty values of “area”

• To do so, we need a new variable, let’s call it area_fixed, which for the first

observation has the value of “area” and it keeps its value, unless the value of
“area” changes.

• IF statement can be used to check a condition (e.g. in this example the
condition is “the area is not blank”, when the condition satisfied area_fixed
update its value).

• We need to introduce a variable which “retains” its value during the implicit
loop that SET creates.

• We then find the middle of range of income as a numeric value which we can find
the average of income for each region.

• To do so, we use IF … THEN syntax, however, later you will learn better ways to
do this task.

Page© Department of Statistics STATS 301

Example-NZ Income

48

data chap01.nz_income;
set chap01.nz_income;
retain area_fixed;
if area ne ” then area_fixed=area;

run;
data chap01.nz_income;

set chap01.nz_income;
if income =’Zero income’ then numeric_income=0;
else if income =’$1-$5,000′ then numeric_income=2500;
else if income =’$5,001-$10,000′ then numeric_income=7500;
else if income =’$10,001-$15,000′ then numeric_income=12500;
else if income =’$15,001-$20,000′ then numeric_income=17500;
else if income =’$20,001-$25,000′ then numeric_income=22500;
else if income =’$25,001-$30,000′ then numeric_income=27500;
else if income =’$30,001-$35,000′ then numeric_income=32500;
else if income =’$35,001-$40,000′ then numeric_income=37500;
else if income =’$40,001-$50,000′ then numeric_income=45000;
else if income =’$50,001-$60,000′ then numeric_income=55000;
else if income =’$60,001-$70,000′ then numeric_income=65000;
else if income =’$70,001-$100,000′ then numeric_income=85000;
else if income =’$100,001-$150,000′ then

numeric_income=1250000;
else if income =’$150,001 or more’ then numeric_income=175000;

run;

Page© Department of Statistics STATS 301

Example-NZ Income

49

• We now calculate the average income for male and female across different areas.

• We like to find the mean of numeric_income across different areas.

• “area_fixed” is our category or “CLASS” variable, i.e. the variable that

classifies the calculation.

• Recall from STATS101 that for a tabulated data (i.e. when you have frequency)
you need to take into account the frequency to compute the mean.

• PROC MEANS can help us in this case.

• The CLASS statement assigns the class variables.

• The VAR statement assigns the variables that the calculation should be done

on them.

• The FREQ statement tells SAS which column of data contains the frequency of
observations.

Page© Department of Statistics STATS 301

Example-NZ Income

50

proc means data=chap01.nz_income;
class area_fixed;
var numeric_income;
freq male;

run;

proc means data=chap01.nz_income;

class area_fixed;
var numeric_income;
freq female;

run;

Page© Department of Statistics STATS 301

Revisiting By Processing

51

Page© Department of Statistics STATS 301

Example-NZ income

52

• We like to work on this data a little more

• First Task:

• Is it possible to write one block of PROC (instead of calling PROC MEANS
two times) which calculates the average of salary for female and male?

• There are many reasons for doing this:

• There are some situations where we like to do the same analysis for

different segments (gender defines a segment here).

• It is easier to maintain one PROC block.

• Writing one PROC block is more efficient.

• SAS has a feature that makes this very simple, as long as there is a variable
(variables) which defines the segments in your data and the values for this
column (columns) are sorted.

• We use PROC SORT to sort data sets.

Page© Department of Statistics STATS 301

Example-NZ income

53

Page© Department of Statistics STATS 301

Example-NZ income

54

data chap01.t_nz_income;
set chap01.nz_income;
count=male;
sex=’M’;
output;
count=female;
sex=’F’;
output;

run;

proc sort data=chap01.t_nz_income;

by sex;
run;

Page© Department of Statistics STATS 301

Example-NZ income

55

• When data are in right shape we can use the SAS feature that mentioned previously
BY PROCESSING

• Look at the BY statement in PROC MEANS.

• Notice that the FREQ statement use the new variable COUNT.

proc means data=chap01.t_nz_income;
by sex;
class area_fixed;
var numeric_income;
freq count;

run;

Page© Department of Statistics STATS 301

Example-NZ income

56

• As you can see, we computed the average of income for male and female across
different areas.

• Some times we need to work on the output of an analysis further more.

• In PROC MEANS you may use the OUTPUT statement to create a data set which

contains the results of PROC MEANS.

• You may use NOPRINT options in PROC MEANS to even stop showing the results
(Since you may be only interested in the output)

• In the following code we specify the name of the output data set and also mention
that we are only interested in the mean calculation.

/* when you have option NOPRINT, SAS drops all output printing */
proc means data=chap01.t_nz_income noprint;

by sex;
class area_fixed;
var numeric_income;
freq count;
output out=chap01.nz_income_summary mean=mean;

run;

Page© Department of Statistics STATS 301

Understanding BY groups

57

• The following figure represents the results of using a single BY variable, zipCode, in
a DATA step. The input data set, zip contains street names, cities, states, and ZIP
codes. The groups are created by specifying the variable zipCode in the BY
statement. The DATA step arranges the zipcodes that have the same values into
groups.

Page© Department of Statistics STATS 301

Understanding BY groups

58

• The figure shows three BY groups. The data set is shown with the BY variables
State and City printed on the left for easy reading. The position of the BY variables
in the observations does not affect how the values are grouped and ordered.

Page© Department of Statistics STATS 301

Example-NZ income

59

• Now we want to calculate the ratio of male and female incomes for each area.

• Ratio can be computed easily by ratio=mean_male/mean_female for each region.

• But for each region male and female are in two separate observations.

• Thus we should put them into one observation.

• It is opposite to what we have done

• We are basically reshaping data sets.

• Beside data step that we used in previous slides, recall that we can use PROC

TRANSPOSE to do reshaping.

Page© Department of Statistics STATS 301

Revisiting PROC TRANSPOSE

60

Page© Department of Statistics STATS 301

PROC TRANSPOSE

61

• Look the following example:

• PROC TRANSPOSE transposes only the numeric variables, Test1, Test2, and

Final, because no VAR statement appears and none of the numeric variables
appear in another statement. OUT= puts the result of the transposition in the
data set SCORE_TRANSPOSED.

data score;
input Student $9. +1 StudentID $ Section $ Test1 Test2 Final;
datalines;
Capalleti 0545 1 94 91 87
Dubose 1252 2 51 65 91
Engles 1167 1 95 97 97
Grant 1230 2 63 75 80
Krupski 2527 2 80 76 71
Lundsford 4860 1 92 40 86
McBane 0674 1 75 78 72
;
proc transpose data=score out=score_transposed;
run;
proc print data=score_transposed;
run;

Page© Department of Statistics STATS 301

PROC TRANSPOSE

62

data fishdata;
infile datalines missover;
input Location & $10. Date date7.
Length1 Weight1 Length2 Weight2 Length3 Weight3
Length4 Weight4;
format date date7.;
datalines;
Cole Pond 2JUN95 31 .25 32 .3 32 .25 33 .3
Cole Pond 3JUL95 33 .32 34 .41 37 .48 32 .28
Cole Pond 4AUG95 29 .23 30 .25 34 .47 32 .3
Eagle Lake 2JUN95 32 .35 32 .25 33 .30
Eagle Lake 3JUL95 30 .20 36 .45
Eagle Lake 4AUG95 33 .30 33 .28 34 .42
;
proc transpose data=fishdata
out=fishlength;
var length1-length4;
by location date;
run;
proc print data=fishlength ;
run;

Page© Department of Statistics STATS 301

PROC TRANSPOSE

63

• Maybe it is a good idea to change the name of variable “col1” to “measurement”.

• We need to modify a data set

• RENAME statement can help us to do this

Page© Department of Statistics STATS 301

PROC TRANSPOSE

64

data fishlength;
set fishlength;
rename col1=Measurement;

run;

• We can do renaming while we are creating the data set!

/* Remember that renaming can be done while we are creating at
the first place
in this case OUT=FISHLENGTH in PROC TRANSPOSE */

proc transpose data=fishdata
out=fishlength(rename=(col1=Measurement));
var length1-length4;
by location date;
run;

Page© Department of Statistics STATS 301

Exercise-04

65

Type Month Sold Repaired Junked

sedan jan 26 41 4

sedan feb 28 48 2

sports jan 16 15 0

sports feb 19 20 1

trucks jan 29 20 3

trucks feb 35 22 4

• Considering the following data:

• What is the shape of data after using PROC TRANSPOSE with Type as BY

variable and Sold, Repaired, and Junked as transpose variables?

Page© Department of Statistics STATS 301

PROC TRANSPOSE

66

Type Month Sold Repaired Junked

sedan jan 26 41 4
sedan feb 28 48 2

Input data
set

sports jan 16 15 0
sports feb 19 20 1
trucks jan 29 20 3
trucks feb 35 22 4

Type _NAME_ COL1 COL2

sedan Sold 26 28
sedan Repaired 41 48
sedan Junked 4 2

Output data
set

sports Sold 16 19
sports Repaired 15 20
sports Junked 0 1
trucks Sold 29 35
trucks Repaired 20 22
trucks Junked 3 4

BY VAR VARS

Page© Department of Statistics STATS 301

Example-NZ income

67

/* since PROC TRANSPOSE use a BY statement we need to sort the
data set by the specific variable */
proc sort data=chap01.nz_income_summary;

by area_fixed;
run;
proc transpose data=chap01.nz_income_summary
out=chap01.t_nz_income_summary;

by area_fixed;
var mean;

run;
proc transpose data=chap01.nz_income_summary
out=chap01.t_nz_income_summary;

by area_fixed;
id sex ; /* the values of a variable can be used to name the

new variables */
var mean;

run;

data chap01.t_nz_income_summary;

set chap01.t_nz_income_summary;
ratio=m/f;

run;

Page© Department of Statistics STATS 301

Example-NZ income

68

• What if, we like to calculate the average of income for each region of NZ rather than
district (area_fixed)

Page© Department of Statistics STATS 301

Revisiting Combining SAS data sets

69

Page© Department of Statistics STATS 301

Concatenating

70

• Concatenating combines two or more SAS data sets, one after the other, into a
single SAS data set. You concatenate data sets by using either the SET statement in
a DATA step or the APPEND procedure.

• The following figure shows the results of concatenating two SAS data sets, and the
DATA step that produces the results.

Data1 Data2 Combined
Year Year Year
2008 2010 2008
2009 2011 2009
2010 + 2012 = 2010
2011 2013 2011
2012 2014 2012

2010
2011
2012
2013
2014

data combined;
set data1 data2;

run;

Page© Department of Statistics STATS 301

Interleaving

71

• Interleaving combines individual, sorted SAS data sets into one sorted SAS data set.
For each observation, the following figure shows the value of the variable by which
the data sets are sorted. You interleave data sets using a SET statement along with
a BY statement.

• In the following example, the data sets are sorted by the variable Year.
Data1 Data2 Combined
Year Year Year
2008 2010 2008
2009 2011 2009
2010 + 2012 = 2010
2011 2013 2010
2012 2014 2011

2011
2012
2012
2013
2014

data combined;
set data1 data2;

by year;
run;

Page© Department of Statistics STATS 301

One-to-one merging

72

• Merging combines observations from two or more SAS data sets into a single
observation in a new data set.

• A one-to-one merge, shown in the following figure, combines observations based on
their position in the data sets. You use the MERGE statement for one-to-one
merging.

Data1 Data2 Combined
VarX VarY VarX VarY
X1 Y1 X1 Y1
X2 Y2 X2 Y2
X3 + Y3 = X3 Y3
X4 Y4 X4 Y4
X5 Y5 X5 Y5

data combined;
merge data1 data2;

run;

Page© Department of Statistics STATS 301

Match-merge

73

• A match-merge, shown in the following figure, combines observations based on the
values of one or more common variables. If you are performing a match-merge, use
the MERGE statement along with a BY statement.

• In the following example, two data sets are match-merged by the value of the
variable Year.

Data1 Data2 Combined
Year VarX Year VarY Year VarX VarY
2008 X1 2010 Y1 2008 X1
2009 X2 2011 Y2 2009 X2
2010 X3 + 2012 Y3 = 2010 X3 Y1
2011 X4 2013 Y4 2011 X4 Y2
2012 X5 2014 Y5 2012 X5 Y3

2013 Y4
2014 Y5

data combined;
merge data1 data2;

by year;
run;

Page© Department of Statistics STATS 301

Exercise-05

74

• The director of a small repertory theatre company, the Little Theatre, maintains
company records in two SAS data sets, COMPANY and FINANCE.

data company;
input Name $ 1-25 Age 27-28 Gender $ 30;
datalines;
Vincent, Martina 34 F
Phillipon, Marie-Odile 28 F
Gunter, Thomas 27 M
Harbinger, Nicholas 36 M
Benito, Gisela 32 F
Rudelich, Herbert 39 M
Sirignano, Emily 12 F
Morrison, Michael 32 M
;
run;

data finance;
input IdNumber $ 1-11 Name $ 13-37 Salary;
datalines;
074-53-9892 Vincent, Martina 35000
776-84-5391 Phillipon, Marie-Odile 29750
929-75-0218 Gunter, Thomas 27500
446-93-2122 Harbinger, Nicholas 33900
228-88-9649 Benito, Gisela 28000
029-46-9261 Rudelich, Herbert 35000
442-21-8075 Sirignano, Emily 5000
;
run;

• To avoid having to maintain two separate data sets, the director wants to merge the
records for each player from both data sets into a new data set that contains all of
the variables.

• Write SAS codes to do the task of combining these two data sets.

Page© Department of Statistics STATS 301

Example-NZ income

75

• Library MAPSGFK (automatically assigned) contains geographical information of all
countries around the world.

• Look at NZ_ATTR data set inside this library.

• We can use this data set to attached the region to our data set.

• Look at the following code for match-merging

• Why we are working on CHAP01.T_NZ_INCOME rather than

CHAP01.T_NZ_INCOME_SUMMARY (the last data set we created and contains
ratio)

proc sort data=chap01.t_nz_income;
by area_fixed;

run;

proc sort data=mapsgfk.nz_attr out=nz_attr;

by idname;
run;
data chap01.t_nz_income;

/* why we use rename here??? */
merge nz_attr chap01.t_nz_income(rename=(area_fixed=idname));
by idname ;

run;

Page© Department of Statistics STATS 301

Example-NZ income

76

proc means data=chap01.t_nz_income noprint;
by sex;
class id1name idname;
var numeric_income;
types id1name idname;
freq count;
output out=chap01.nz_income_summary mean=mean;

run;
proc sort data=chap01.nz_income_summary;

by id1name idname;
run;
proc transpose data=chap01.nz_income_summary
out=chap01.t_nz_income_summary;

by id1name idname;
id sex ;
var mean;

run;

data chap01.t_nz_income_summary;

set chap01.t_nz_income_summary;
ratio=m/f;

run;

ID1NAME is the variable for region and
idname is the new variable for area_fixed

Page© Department of Statistics STATS 301

TYPES statement in PROC MEANS

77

• Recall:

• The TYPES statement Identifies which of the possible combinations of class

variables to generate.

• OR specifies which of the combinations of class variables PROC MEANS
uses to create the types, where is the number of class variables.

2k
k

Page© Department of Statistics STATS 301

Example-03

78

/* Example of using TYPES in PROC MEANS */
data grade;
input Name $ 1-8 Gender $ 11 Status $13 Year $ 15-16
Section $ 18 Score 20-21 FinalGrade 23-24;
datalines;
Abbott F 2 97 A 90 87
Branford M 1 98 A 92 97
Crandell M 2 98 B 81 71
Dennison M 1 97 A 85 72
Edgar F 1 98 B 89 80
Faust M 1 97 B 78 73
Greeley F 2 97 A 82 91
Hart F 1 98 B 84 80
Isley M 2 97 A 88 86
Jasper M 1 97 B 91 93
;
proc means data=grade;
var Score;
class Status Year;
types () status*year;
run;

• Use ( ) to request the overall total

Page© Department of Statistics STATS 301

Match-merge by Hash Table

79

• What is a Hash Table

• An in-memory lookup table accessible from the DATA step. A hash object is

loaded with records and is only available from the DATA step that creates it. A
hash record consists of two parts: a key part and a data part. The key part
consists of one or more character and numeric values. The data part consists of
zero or more character and numeric values.

• Because the hash object entries are held in memory and they have a special
structure, finding the data value that corresponds to a given key happens
amazingly fast!

Page© Department of Statistics STATS 301

Match-merge by Hash Table

80

• Suppose we have the following match-merge task

Main data Lookup
Year VarX Year VarY
2008 X1 2010 Y1
2009 X2 2011 Y2
2010 X3 2012 Y3
2011 X4 2013 Y4
2012 X5 2014 Y5
2013 X6
2009 X7
2010 X8
2011 X9
2012 X10
2018 X11

HASH
TABLE

Merge

Page© Department of Statistics STATS 301

Match-merge by Hash Table

81

• Defining a HASH TABLE based on an existing table.

• To tell SAS that you want to create a hash table, you use the DCL HASH

statement in DATA STEP, and use a name for the hash table (like h):

• The existing data set that you like to be the source of the hash table can be
introduced to the hash table by using DATASET: option inside ( ) right after the
name of the hash table (in the following example we like to use the SASDATASET
data set as the source for the hash table)

• dcl hash h(dataset:’SASDATASET’);

• Next you need to tell SAS which variables in the existing data set should be used

as the lookup keys

• rc=h.definekey(‘VARNAMES’);

• Next you need to tell SAS which variables should be used as the variables you

like to attach to the master data set.

• rc=h.definedata(‘VARNAMES’);

Page© Department of Statistics STATS 301

Example-NZ income

82

• We do the match-merging of the NZ income data by Hash Table

data chap01.nz_income_hash;
/* what is the purpose of DROP in this code */
drop area income male female;
set chap01.nz_income;
count=male;
sex=’M’;
output;
count=female;
sex=’F’;
output;

run;

Page© Department of Statistics STATS 301

Example-NZ income

83

• We do the match-merging of the NZ income data by Hash Table

/* attach the region from mapsgfk.nz_attr directly
no sorting is needed */
data chap01.nz_income_hash;

set chap01.nz_income_hash;
length id id1 $15 idname id1name $55;
if _n_=1 then do;

dcl hash h(dataset:’mapsgfk.nz_attr’);
rc=h.definekey(‘idname’);
rc=h.definedata(‘idname’,’id1name’,’id’,’id1′);
rc=h.definedone();

end;
idname=area_fixed;
rc=h.find();
if rc ne 0 then put _all_;

run;

Page© Department of Statistics STATS 301

Putting the results on a map

841.4 1.6 1.8
ratio

Page© Department of Statistics STATS 301

Putting the results on a map

85

• There are two procedures in SAS for plotting data on a map, PROC GMAP, PROC
SGMAP.

• But, in this example we use the shape of areas and plot them as a polygon. The
shape of areas can be downloaded from internet (shape files). SAS (when
GRAPH component is licensed) has the shape data of many places in the world,
including New Zealand.

• Check MAPSGFK library to see what kind of data are available.

• We used the Longitude and Latitude to plot the map,

• However, in some situation you may need to project the map into X-Y plane. The
projection of maps can be done by PROC GPROJECT in SAS.

• The mapsgfk library in SAS contains both unprojected and projected data.

Page© Department of Statistics STATS 301

Putting the results on a map

86

/* drawing a map in SAS */

data ratio_region;

set mapsgfk.nz_attr;
ratio=.;
if _n_=1 then do;

dcl hash h(dataset:’chap01.t_nz_income_summary’);
rc=h.definekey(‘idname’);
rc=h.definedata(‘ratio’);
rc=h.definedone();

end;
rc=h.find();

run;

Page© Department of Statistics STATS 301

Putting the results on a map

87

data ourmapdata;
set mapsgfk.nz;
/* remote places */
where id ne ‘NZ-067′ ;
by id segment;
if first.segment then polyid+1;
ratio=.;
if _n_=1 then do;

dcl hash h(dataset:’ratio_region’);
rc=h.definekey(‘id’);
rc=h.definedata(‘ratio’);
rc=h.definedone();

end;
rc=h.find();

run;
proc sgplot data=ourmapdata aspect=1.426521 ;

polygon x=long y=lat id=polyid/ fill colorresponse=ratio
outline lineattrs=(color=black);
xaxis display=none;

yaxis display=none;
gradlegend/ position=bottom;
run;

Page© Department of Statistics STATS 301

Example-Europe BMI data

88

Page© Department of Statistics STATS 301

Example-Europe BMI data

89

• In this example we access to two sets of data:

• “hlth_europe.sas7bdat” is a SAS data set which contains the percentage of

“overweight”, “underweight”, and “normal” group (based on BMI) in some
European countries in 2014. The data are available across sex and age group.

• “pop_europe.txt” file is CSV file and contains the population information for
some European countries in 2014.

• What is the over all distribution of BMI in all countries in BMI data?

Page© Department of Statistics STATS 301

Example-Europe BMI data

90

• What we need to do is to calculate the average of percentage across all countries.

• However, since, the population size of each country is different from others, to

calculate the average we can not simply take the mean of all percentages.

• We need to take into account the population size.

• We first transpose the “HLTH_EUROPE” data set in such a way that we have

two variables instead of 31 variables with the name of countries.

• One variable for the name of the country.

• One variable for the value of the percentage.

• Then, we can merge the population data to the transposed “HLTH_EUROPE”
and then we can use PROC SUMMARY to calculated the average of values
across the countries while taking into account the population size of each
country (using FREQ statement in PROC SUMMARY).

Page© Department of Statistics STATS 301

Example-Europe BMI data

91

• You may recall from the “NZ_INCOME” example that the following code can be used
to transpose BMI data.

• Look at the following code

• However, there are 31 countries and following this approach may not be very

elegant.

data temp;
set chap01.hlth_europe;
length country $15.;
country=’Belgium’;
value=belgium;
output;
country=’Bulgaria’;
value=Bulgaria;
output;

run;

Page© Department of Statistics STATS 301

Example-Europe BMI data

92

• ARRAY can be used to simplify the code.

• Some remarks:

• When we don’t know the dimension in advance * can be used in the definition of
an array.

• Belgium – – Turkey is a shortcut for all variables in our data set which are
between variables “belgium” and “turkey”, including both “belgium” and
“turkey”. (in our data set the column definitions are in such a way that the
variables of country names are between variables “belgium” and “turkey”)

• The DIM( ) function can be used to query the dimension of an array.

• The VNAME( ) function can be used to query the actual name of a variable in

ARRAY.
data chap01.t_hlth_europe;

set chap01.hlth_europe;
array cont{*} Belgium–turkey;
length country $15;
do i=1 to dim(cont);

value=cont{i};
country=vname(cont{i});
output;

end;
keep age sex bmi country value;

run;

Page© Department of Statistics STATS 301

Example-Europe BMI data

93

• Now we can read the population data into SAS and merge it to the BMI data

filename eu_pop ‘~/STATS301/Chapter01/pop_europe.txt’;
proc import datafile=eu_pop out=chap01.pop_europe dbms=csv
replace;
getnames=yes;
/* since our data is small we tell SAS to scan all
data to figure out the informats. */
guessingrows=max;
run;

Page© Department of Statistics STATS 301

Example-Europe BMI data

94

• Merging data.

/* merge population to BMI data */
proc sort data=chap01.t_hlth_europe;

by age sex country;
run;
proc sort data=chap01.pop_europe;

by age sex country;
run;
data chap01.BMI_europe;

merge chap01.t_hlth_europe chap01.pop_europe;
by age sex country;

run;

Page© Department of Statistics STATS 301

Example-Europe BMI data

95

• Now data are in a shape that any calculation can be easily done.

proc summary data=chap01.bmi_europe;
by age sex ;
class bmi;
var value;
freq population;
types bmi;
output out=chap01.BMI_overall mean=mean;

run;

Page© Department of Statistics STATS 301

Exercise-06

96

• We used the following code to transpose the bmi data set.

• Can you write SAS code that does the same task, but using PROC TRANSPOSE?

data chap01.t_hlth_europe;
set chap01.hlth_europe;
array cont{*} Belgium–turkey;
length country $15;
do i=1 to dim(cont);

value=cont{i};
country=vname(cont{i});
output;

end;
keep age sex bmi country value;

run;

Page© Department of Statistics STATS 301

Sales project example – Simulating data
values

97

Page© Department of Statistics STATS 301

Example-Sales Project

98

• A sales manager has a certain number of potential sales projects in the sales
pipeline that have the potential to be closed before year end. For financial planning,
the finance department requests regular updates on the expected sales amount per
division from the sales managers. (from Applying Data Science, G. Svolba – 2017)

• For each sales project, the sales manager knows the expected sales amount in
the offer that was made to the client.

• The sales manager estimated the subjective probability that the respective
project is sold before year end, based on his or her sales experience.

• Note that such probabilities are often also automatically calculated by sales
force automation tools based on criteria like:

• Does the customer have a budget for the project?

• Has a detailed offer for the project been presented to the customer?

• Has a verbal commitment been received from the customer?

• For some projects (like Project 10017), sales manager has the chance to try it in
the second half of the year if in the first half he or she was not successful,
however, these projects have different Deal Size and Probability in the second
round.

Page© Department of Statistics STATS 301

Example-Sales Project

99

ProjectID DealSize probability DZ2 prob2
10001 $1,500,000.00 0.1 . .
10002 $10,000.00 0.65 . .
10003 $500,000.00 0.2 . .
10004 $50,000.00 0.5 . .
10005 $100,000.00 0.4 . .
10006 $30,000.00 0.9 . .
10007 $10,000.00 0.6 . .
10008 $150,000.00 0.2 . .
10009 $200,000.00 0.25 . .
10010 $180,000.00 0.1 . .
10011 $900,000.00 0.1 . .
10012 $750,000.00 0.2 . .
10013 $600,000.00 0.1 . .
10014 $320,000.00 0.2 . .
10015 $100,000.00 0.4 . .
10016 $50,000.00 0.8 . .
10017 $2,000,000.00 0.05 $1,990,000.00 0.03
10018 $400,000.00 0.2 . .
10019 $2,500,000.00 0.1 . .
10020 $1,700,000.00 0.15 . .
10021 $100,000.00 0.8 . .

Page© Department of Statistics STATS 301

Example-Sales Project

100

• From upper management, the sales manager experiences pressure to bring in a
substantial sales amount before the end of the fiscal year. After the last meeting with
his or her manager, the sales manager was given the following information:

• If the sales manager manages to exceed $2,400,000 in total, he or she receives a
special bonus.

• If the sales manager manages to exceed $3,000,000 in total, he or she gets
promoted to the position of a regional director.

• If, however, the sales manager achieves less than $1,000,000, he or she will
most likely get fired.

• Otherwise, if the sales manager stays between $1,000,000 and $2,400,000, he or
she remains at his/her current position.

Page© Department of Statistics STATS 301

Example-Sales Project

101

• For a moment, ignore the second chance for projects and let assume that there is
no second chance.

• The sales manager remembers from a statistics class that he or she should
calculate a weighted sales amount. The sales manager achieves this by multiplying
the total sales amount with the probability of the respective sales project. This
considers the fact that some of the larger projects might have a lower probability to
be closed than some of the smaller projects.

proc means data=chap01.sales_project sum n mean;
run;

proc means data=chap01.sales_project sum;

var DealSize;
weight probability;

run;

Page© Department of Statistics STATS 301

Example-Sales Project

102

• The sales manager might now conclude that he or she is quite safe from being
fired, as the expected value of the sales pipeline exceeds the $1,000,000 threshold.

• With some effort and some luck to close one of the large deals, he or she might
even end up at a value larger than $2,400,000 and the sales manager can receive
the special bonus.

• The question arises whether the expected value of $1,661,500 shows the full
picture of possible outcomes. You have to bear in mind that only the average case
is considered here, and no information about the distribution is shown.

• The expected value calculated above, however, is only a point estimate. It does not
provide information on the distribution of the probability of other possible total
sales amounts:

• If the variability of the total sales amounts is low, you can expect that the sales
manager ends up with a value that is close to the expected value.

• A high variability indicates that higher and lower outcome values are also highly
possible.

Page© Department of Statistics STATS 301

Example-Sales Project

103

• To properly analyse these data, we must do a simulation study.

• It means that we should investigate what will happen if any of these projects

closed or not closed, by the end of the year.

• As an example one possible outcome can be

• SUCCESS: 10006, 10007, 10015, 10016, 10021

• FAIL: 10001,10002,10003,10004,10005,10008,10009

10010,10011,10012,10013,10014,10017,10018,10019,10020

• Thus the total deal size would be: $290,000.00

Page© Department of Statistics STATS 301

Example-Sales Project

104

data chap01.sim;
set chap01.sales_project;
do rep=1 to 100000;

selected=rand(‘bernoulli’, probability);
if selected=1 then

outcomevalue=dealsize;
else if prob2 ne . then

do;
selected=rand(‘bernoulli’, prob2);
outcomevalue=selected*dz2;

end;
else

outcomevalue=0;
output;

end;
run;

• The RAND( ) function can be used to generate a random number from a given
distribution.

• For example, RAND(‘BERNOULLI’,p) generates a random number from the
Bernoulli distribution with p as probability of success.

• Note how I used IF … THEN DO;….;END; block to have more statements in an IF
block.

Page© Department of Statistics STATS 301

Example-Sales Project

105

proc sort data=chap01.sim;
by rep;

run;

proc summary data=chap01.sim;

by rep;
var outcomevalue;
output out=chap01.sim_summary sum=sum_dealsize;

run;

Page© Department of Statistics STATS 301

Example-Sales Project

106

proc format;
value sales low – 1000000=’1. Fired’

1000001 – 2400000=’2. OK’
2400001 – 3000000=’3. Bouns’
3000001-high=’4. Promotion’;

run;

proc freq data=chap01.sim_summary;

format sum_dealsize sales.;
table sum_dealsize;

run;

• Using PROC FORMAT and PROC FREQ to summarise the findings

Page© Department of Statistics STATS 301

Example-Sales Project

107

proc sgplot data=chap01.sim_summary;
format sum_dealsize dollar12.;
histogram sum_dealsize;

run;

• Visualising the distribution of total deal size.

$0 $2,000,000 $4,000,000 $6,000,000 $8,000,000

Deal Size(Total)

0

2

4

6

8

P
er

ce
nt

F
ire

d

B
on

us

P
ro

m
ot

io
n

Page© Department of Statistics STATS 301

Not Examinable for STATS 301

Example-Sales Project

108

• In the previous slides we used DO loop to simulate the scenario that a particular
project would be closed by the end of the year.

• Since we simulate all scenarios in such a way that the replication was not sorted we
needed to use a PROC SORT to prepare data for BY processing.

• However,

• Our input data set (sales information) is a small data set.

• Instead of simulating scenarios and then summarising, we can simulate one

replicate for all projects in the data set and summarise it and only keep the
summary, and repeat this for multiple times to have a better understanding of the
distribution of sale.

• If we are going to do so, we need to repeatedly,

• Go through all observations in our input data and simulate one year

scenario.

• Calculate the total deal size for simulated scenario.

• Save the total deal size.

• In the next few slides we are going to learn about a special kind of ARRAY which
can be used to read all observations in a temporary location.

Page© Department of Statistics STATS 301

Not Examinable for STATS 301

TEMPORARY ARRAY

109

• The regular ARRAY which we learnt in previous slides are actually a shortcut to refer
to variables in a data set.

• One key feature of the regular ARRAY was that its elements values depends on
the variables’ values. Thus, for each observation they are empty, unless, their
corresponding variables have some values.

• HOWEVER, the elements of a temporary array are temporary in that they don’t need
to be related to a variable in data set, and they do not appear in any result table.
Temporary data element values are automatically retained across iterations rather
than being reset to missing at the beginning of the next iteration (next observation).
Temporary arrays exist only for the duration of the DATA STEP.

• SYNTAX:

• numeric: ARRAY arrayname{arraydim} _TEMPORARY_;

• characters: ARRAY arrayname{arraydim} $ _TEMPORARY_;

Page© Department of Statistics STATS 301

Not Examinable for STATS 301

Example-Sales Project

110

• In Sales Project example;

• We define a temporary array which holds all data values of the input data set.

• Since, we may not know the exact size of array in advanced, we define a

relatively large array which is big enough to contain all data values.

• We use SET with the POINT option to directly reach each row of data

• WHENEVER, you use SET and POINT together, put a STOP statement at the

end of your program.

• NOBS can be used to query the total number of observation in the data set that
we used in SET.

data _null_;
array vals[999,4] _temporary_;
do ind=1 to totalsize;

set chap01.sales_project point=ind nobs=totalsize;
vals[ind, 1]=probability;
vals[ind, 2]=dealsize;
vals[ind, 3]=prob2;
vals[ind, 4]=dz2;

end;
stop;

run;

Page© Department of Statistics STATS 301

Not Examinable for STATS 301

Example-Sales Project

111

data chap01.sim_array;
keep rep sum_dealsize;
array prob_val[999, 4] _temporary_;
do ind=1 to totalsize;

set chap01.sales_project point=ind nobs=totalsize;
prob_val[ind, 1]=probability;
prob_val[ind, 2]=dealsize;
prob_val[ind, 3]=prob2;
prob_val[ind, 4]=dz2;

end;
do rep=1 to 100000;

sum_dealsize=0;
do ind=1 to totalsize;

selected=rand(‘bernoulli’, prob_val[ind, 1]);
if selected=1 then

sum_dealsize+prob_val[ind, 2];
else if prob_val[ind, 3] ne . then do;

selected=rand(‘bernoulli’, prob_val[ind, 3]);
sum_dealsize+selected*prob_val[ind, 4];

end;
end;
output;

end;
stop;

run;

Page© Department of Statistics STATS 301

Summary

112

• PROC SUMMARY (MEANS) is a simple procedure to summarise the data across
different class variables.

• To create a customised tabular report we can use PROC TABULATE.

• We can use PROC FORMAT to create customised format based on SAS data sets.

• The DO statement can be used to repeat a task automatically.

• To work with multiple variables during a DATA STEP, we should use ARRAY.

• Hash table is an efficient method for merging data without sorting.

• We can use the SET statement with the POINT option to reach a specific

observation in data sets.