Assignment 3

Stat 466/866, Fall 2021 Due: 11:59pm Sunday, November 21

Assignment 3

1 of 5

Note: For this assignment, you again only need to submit your SAS program file. Use internal comments to
identify the question numbers and respond to questions. Please ensure your SAS program file is executable. Two

points will be rewarded for making your code as readable as possible by effective using indentation, white space,

and some (but not excessive) internal comments – we will discuss this in class.

1. Stet up

a) Start your program with the OPTIONS statement to set the following global options: 1) have the log
display a warning if a MERGE statement is run without a corresponding BY statement, 2) have SAS

provide a slightly more informative log that will note every time a variable is overwritten when merging

data sets, 3) allow SAS to use data sets that contain unrecognized formats by ignoring the formats rather

than the default action of issuing a critical error and not reading the data set.

b) Create a macro variable named path that will define the path containing every file (input or output) used
for this assignment. Have the macro variable include the final forward or backward slash in the path so

that when we mark your code whether we are running Windows or OnDemand (Unix) all we need to

change is the path macro variable and the rest of your code will run without editing.

2. Descriptive summary of an exercise study

The Excel workbook named ExerciseRCT.xlsx contains data from a randomized controlled exercise trial

where participants were randomly assigned in equal numbers to one of four exercise groups: Control, LALI

(low amount at low intensity) HALI (high amount at low intensity), and HAHI (high amount at high intensity).

Measurements including VO2max (a measure of cardio fitness) were taken at baseline and at weeks 4, 8, 16

and 24.

a) Create new variables containing the change in VO2max from baseline (week 0) to weeks 4, 8, 16 and 24.

b) Create a boxplots exactly like shown in the figure on the next page. Note the red dotted reference line, the
labels, the horizontal gridlines and the order of the groups. Also note the look of the boxes and the number

of observations (nobs) listed below each box. Hints: To make multiple boxplots in one panel you can use

the GROUP and/or CATEGORY options. The boxes produced by GROUP and CATEGORY have a

different appearance and different options to order the boxes and to display statistics below the boxes. You

may need to pre-sort the data and use an option of the ROWAXIS statement to tell SAS to present the

boxplots in the order of the data.

c) Use a SAS procedure to create a table summarizing the change scores overall and by sex as shown in the
table on the next page. Without editing the table generated by SAS, make the table have the same layout,

group order, labelling and formatting (including decimals) as shown. Note the text in the top left box of

the table and the use of NA rather than a dot for values not assessed and the footnote. You may need to

pre-sort the data and use an option to get the table to print the groups in the correct group order.

Stat 466/866, Fall 2021 Due: 11:59pm Sunday, November 21

Assignment 3

2 of 5

VO2max (L/min)
Total Treatment Group

Control LALI HALI HAHI

N N Mean SE N Mean SE N Mean SE N Mean SE

Overall Baseline 299 74 2.69 0.09 73 2.64 0.08 76 2.71 0.08 76 2.71 0.08

Change at 4 weeks 188 0 NA NA 62 0.11 0.03 67 0.20 0.03 59 0.27 0.03

Change at 8 weeks 169 0 NA NA 55 0.25 0.03 63 0.33 0.03 51 0.39 0.03

Change at 16 weeks 182 24 0.01 0.04 52 0.26 0.03 60 0.37 0.04 46 0.47 0.04

Change at 24 weeks 194 43 -0.02 0.04 53 0.24 0.03 56 0.40 0.04 42 0.54 0.05

By Sex
104 26 3.48 0.13 24 3.43 0.11 27 3.41 0.11 27 3.40 0.11

Male Baseline

Change at 4 weeks 64 0 NA NA 19 0.10 0.07 24 0.23 0.05 21 0.37 0.07

Change at 8 weeks 63 0 NA NA 21 0.31 0.06 24 0.40 0.06 18 0.44 0.05

Change at 16 weeks 70 10 0.02 0.07 19 0.37 0.06 24 0.46 0.08 17 0.61 0.08

Change at 24 weeks 69 13 0.04 0.07 18 0.39 0.07 21 0.49 0.09 17 0.68 0.08

Female Baseline 195 48 2.26 0.06 49 2.26 0.05 49 2.32 0.05 49 2.33 0.05

Change at 4 weeks 124 0 NA NA 43 0.11 0.03 43 0.19 0.03 38 0.22 0.03

Change at 8 weeks 106 0 NA NA 34 0.21 0.03 39 0.28 0.03 33 0.36 0.04

Change at 16 weeks 112 14 0.01 0.04 33 0.19 0.04 36 0.32 0.03 29 0.38 0.04

Change at 24 weeks 125 30 -0.04 0.04 35 0.17 0.03 35 0.35 0.04 25 0.43 0.06

SE-Standard Error, LALI-low amount at low intensity, HALI-low amount at low intensity, HAHI-high amount at high intensity, NA-not assessed

Stat 466/866, Fall 2021 Due: 11:59pm Sunday, November 21

Assignment 3

3 of 5

3. Creating an html report with drill down capability

The SAS data sets clients, projects and time from the assignment 3 folder contain information from a

statistician’s consulting database. The time dataset is related to the project dataset by the ProjectID variable

which is a unique identifier for each project. The projects dataset is related to the clients dataset by the client

name (FirstName and LastName).

1. Create a temporary clients dataset sorted by LastName Firstname and a temporary projects dataset
sorted by ProjectID. While doing this have SAS check for and remove any duplicate records (i.e.

multiple client records with the exact same first and last name and multiple project records with the

same ProjectID). In a comment, tell me how many duplicates were found and removed from each

dataset.

2. Calculate the first date, last date and total hours worked for each project from the time dataset and merge
this information into the projects dataset. Only keep projects with a matching record in the time dataset.

3. Calculate the total number of projects and the total hours across all projects for each client, and merge
this information into the clients dataset. Only keep records for clients who have some data in the projects

dataset.

4. Use data step programming to generate html files named Clients.html and Projectes.html that are
identical to the files of the same name posted onQ. Notice that I report the time worked in days rather

than hours by assuming that one day is 7.5 hours. I also want you to present the days worked with one

decimal and the dates in word format as I have. If you use the exact same html tags your files should

look just like mine in a web browser. Notice that when you click on any part of a client’s name in the

Clients.html you jump to their project info as reported in Projects.html. You can view files’ raw

contents (including tags) in SAS studio by right clicking on them and selecting “View File as Text”; or if

you are viewing in Google Chrome, you can right click and choose to view “view page source”. I suggest

you add the linking tags (i.e. the anchors) after everything else is done properly. The SAS PUT statement

automatically adds a space after variable values. Use the technique we’ve seen several times to remove

these spaces. This is important since an extra space at the end of the hyperlink anchors might stop the

hyperlinks from working properly.

WARNING: You’ll need to put both Clients.html and Projectes.html in the same folder on your computer for

the hyperlinks to work. The hyperlinks won’t work properly directly from onQ or directly in the SAS OnDemand

environment or if they are Zipped, so you will need to download them to your computer (and if necessary unZip

them) to check if the hyperlinks work.

The following html tags are used in the reports.

must start and end the file.

bolds text

– hard return. Actual hard returns in the source file are not recognized by the browser but should be used to

make the source file easier to read.

main title

-Makes main title centered and in header1 style.

LastName, FirstName
-creates a

hyperlink so that when you click on the text Lastname, FirstName you jump to the LastName Firstname

destination anchor in the Projects.html file.

Creates a destination anchor in Projects.html for the previous

hyperlink to jump to.

Lastname, FirstName

– displays Lastname, FirstName in Projects.html as a level 2 header.

Stat 466/866, Fall 2021 Due: 11:59pm Sunday, November 21

Assignment 3

4 of 5

4. Creating a Macro to Examine Correlation:
a)

If x and x’ are distributed as two independent standard normal variables and y=rho*x+sqrt(1-rho2)*x’, then x and y

have a standard bivariate normal distribution with correlation=rho. From this standard bivariate distribution, you

can scale x any y to have any mean and standard deviation by simply multiplying by the standard deviation and

adding the mean. The result remains bivariate with a correlation between x and y of rho.

Create a macro named binorm that creates a data set with the random variables x and y generated from a bivariate

normal distribution. Give binorm the keyword parameters mux, muy, stdx, stdy, rho and seed that allow you to

set the means, standard deviations, correlation between x and y and the seed for the random number generator.

Have these parameters default to generate x and y from two independent (i.e. rho=0) standard normal distributions

with a seed based on the computers internal clock. Add the parameters n and sims that allow you to specify how

many observations to generate. You will generate n observations for each of sims simulations. Have n and sims

default to 100 and 1 respectively. Add a parameter named outdata that allows you to define the name of the data

set created. outdata should default to binorm. The final data set will have n*sims observations with the variables

simnum, x and y only. simnum will be 1 for the first n records, 2 for the next n records and so on.

Add a final parameter to the macro named plot that defaults to N. When plot is set to Y, have the macro generate a

simple scatter plot of y vs. x. Make the plot parameter insensitive to case and any additional letters after the first

letter so that a plot will be generated if and only if the plot parameter value that starts with y or Y.

Put a main title on the figure saying “Plot of y vs. x” and a second title that would say “n=10000, X~N(0, 1),

Y~N(0, 1), corr(x, y)=0” for the default settings, but would report the actual settings if they were different from

the default. Note that in the title N(0, 1) means normal with a mean of 0 and variance of 1 which is the standard

normal distribution, so 0 and 1 may change depending on the settings of the parameters. Define any new macro

variables created in the macro as local to the macro. Make sure no extra spaces appear in the title. When the

plot=Y, have the macro generate a separate plot for each simulation. The simulation number should appear

automatically as the third title on the plot.

Run a simulation of n= 100,000 and use PROC CORR to validate that the generated estimates are close to your

parameter settings. Try for a couple settings.

Try running this macro with default settings except turn the plot option on and run for correlations 0, 0.2, 0.5 and

0.9 respectively. Next try running five simulations of sample size 50 with plots for correlations 0 and 0.2. With a

sample size of 50 can you tell the difference between a correlation of 0 and 0.2?

b) The remainder of this assignment is for STAT866 only:

The formula given above is an easy way to generate bivariate normal data. However, this approach will not

generate multivariate normal data of more than 2 dimensions. SAS/STAT has a procedure named PROC

SIMNORMAL that is specifically designed to simulate multivariate normal data. Try using PROC SIMORMAL in

your macro instead of the data step method you just used. For this question, we will stick with the bivariate normal

distribution, so using PROC SIMNORMAL is an unnecessary complex approach, but I want to expose you to

PROC SIMNORMAL, the concept of special dataset types, passing macro variable values to SAS variables, and

Rick Wicklin’s excellent SAS blog. I suggest you read this following blog before continuing with this question:

https://blogs.sas.com/content/iml/2017/09/25/simulate-multivariate-normal-data-sas-simnormal.html .

https://blogs.sas.com/content/iml/2017/09/25/simulate-multivariate-normal-data-sas-simnormal.html

Stat 466/866, Fall 2021 Due: 11:59pm Sunday, November 21

Assignment 3

5 of 5

The trickiest part of this question is that you need to create a special TYPE=CORR dataset containing the

specifications for PROC SIMNORMAL. Bellow, I show what this dataset would look like for a bivariate normal

distribution where X and Y are correlated with rho=0.5 and X~N(2, 9) and Y~N(3, 4).

Obs _type_ _name_ X Y

1 MEAN 2.0 3.0

2 STD 3.0 2.0

3 CORR X 1.0 0.5

4 CORR Y 0.5 1.0

You may notice the dataset in Rick Whitlam’s blog contained a row where _type_=’N’. This is because his dataset

was generated by PROC CORR which includes a row indicating the original sample size, but PROC

SIMNORMAL does not use the N row. Instead, it uses an option named NUMREAL to determine how many

random observations (i.e. number of realizations) to generate. One complication with creating this special dataset

is that you cannot use instream data, because macro variables do not resolve in DATALINES. So instead, you will

have to use a data step with four OUTPUT statements to create the 4 rows required by the special TYPE=CORR

dataset as shown above.

Another issue you will have to contend with, is that you need a variable indicating the simulation number so you

can do things like produce separate plots by simulation. There are two approaches you can take to get this form

PROC SIMNORMAL. The less efficient approach is to replicate your special TYPE=CORR dataset once for each

of the sims simulations, and then run PROC SIMNOMRAL BY simnum. But a more efficient approach (which

you will need to use for full marks) is to tell PROC SIMNORMAL to generate n*sims observations and then

obtain the simulation number from the overall observation number. For example, if we wanted 5 simulations of

sample size 100, then the simnum variable should be 1 for the first 100 observations, and 2 for the next 100

observations and so on. A simple expression using the int or floor function could do this.

We’ll see in our last week of class that multivariate simulation may be easier to do using SAS PROC IML. This is

also described in Rick Wicklin’s blog at: https://blogs.sas.com/content/iml/2011/01/12/sampling-from-the-

multivariate-normal-distribution.html But for now please try to do this with PROC SIMNORMAL as described

above.

https://blogs.sas.com/content/iml/2011/01/12/sampling-from-the-multivariate-normal-distribution.html
https://blogs.sas.com/content/iml/2011/01/12/sampling-from-the-multivariate-normal-distribution.html

main title

Lastname, FirstName

Related Posts