Assignment 3
Stat 466/866, Fall 2021 Due: 11:59pm Sunday, November 21
Assignment 3
1 of 5
Note: For this assignment, you again only need to submit your SAS program file. Use internal comments to
identify the question numbers and respond to questions. Please ensure your SAS program file is executable. Two
points will be rewarded for making your code as readable as possible by effective using indentation, white space,
and some (but not excessive) internal comments – we will discuss this in class.
1. Stet up
a) Start your program with the OPTIONS statement to set the following global options: 1) have the log
display a warning if a MERGE statement is run without a corresponding BY statement, 2) have SAS
provide a slightly more informative log that will note every time a variable is overwritten when merging
data sets, 3) allow SAS to use data sets that contain unrecognized formats by ignoring the formats rather
than the default action of issuing a critical error and not reading the data set.
b) Create a macro variable named path that will define the path containing every file (input or output) used
for this assignment. Have the macro variable include the final forward or backward slash in the path so
that when we mark your code whether we are running Windows or OnDemand (Unix) all we need to
change is the path macro variable and the rest of your code will run without editing.
2. Descriptive summary of an exercise study
The Excel workbook named ExerciseRCT.xlsx contains data from a randomized controlled exercise trial
where participants were randomly assigned in equal numbers to one of four exercise groups: Control, LALI
(low amount at low intensity) HALI (high amount at low intensity), and HAHI (high amount at high intensity).
Measurements including VO2max (a measure of cardio fitness) were taken at baseline and at weeks 4, 8, 16
and 24.
a) Create new variables containing the change in VO2max from baseline (week 0) to weeks 4, 8, 16 and 24.
b) Create a boxplots exactly like shown in the figure on the next page. Note the red dotted reference line, the
labels, the horizontal gridlines and the order of the groups. Also note the look of the boxes and the number
of observations (nobs) listed below each box. Hints: To make multiple boxplots in one panel you can use
the GROUP and/or CATEGORY options. The boxes produced by GROUP and CATEGORY have a
different appearance and different options to order the boxes and to display statistics below the boxes. You
may need to pre-sort the data and use an option of the ROWAXIS statement to tell SAS to present the
boxplots in the order of the data.
c) Use a SAS procedure to create a table summarizing the change scores overall and by sex as shown in the
table on the next page. Without editing the table generated by SAS, make the table have the same layout,
group order, labelling and formatting (including decimals) as shown. Note the text in the top left box of
the table and the use of NA rather than a dot for values not assessed and the footnote. You may need to
pre-sort the data and use an option to get the table to print the groups in the correct group order.
Stat 466/866, Fall 2021 Due: 11:59pm Sunday, November 21
Assignment 3
2 of 5
VO2max (L/min)
Total Treatment Group
Control LALI HALI HAHI
N N Mean SE N Mean SE N Mean SE N Mean SE
Overall Baseline 299 74 2.69 0.09 73 2.64 0.08 76 2.71 0.08 76 2.71 0.08
Change at 4 weeks 188 0 NA NA 62 0.11 0.03 67 0.20 0.03 59 0.27 0.03
Change at 8 weeks 169 0 NA NA 55 0.25 0.03 63 0.33 0.03 51 0.39 0.03
Change at 16 weeks 182 24 0.01 0.04 52 0.26 0.03 60 0.37 0.04 46 0.47 0.04
Change at 24 weeks 194 43 -0.02 0.04 53 0.24 0.03 56 0.40 0.04 42 0.54 0.05
By Sex
104 26 3.48 0.13 24 3.43 0.11 27 3.41 0.11 27 3.40 0.11
Male Baseline
Change at 4 weeks 64 0 NA NA 19 0.10 0.07 24 0.23 0.05 21 0.37 0.07
Change at 8 weeks 63 0 NA NA 21 0.31 0.06 24 0.40 0.06 18 0.44 0.05
Change at 16 weeks 70 10 0.02 0.07 19 0.37 0.06 24 0.46 0.08 17 0.61 0.08
Change at 24 weeks 69 13 0.04 0.07 18 0.39 0.07 21 0.49 0.09 17 0.68 0.08
Female Baseline 195 48 2.26 0.06 49 2.26 0.05 49 2.32 0.05 49 2.33 0.05
Change at 4 weeks 124 0 NA NA 43 0.11 0.03 43 0.19 0.03 38 0.22 0.03
Change at 8 weeks 106 0 NA NA 34 0.21 0.03 39 0.28 0.03 33 0.36 0.04
Change at 16 weeks 112 14 0.01 0.04 33 0.19 0.04 36 0.32 0.03 29 0.38 0.04
Change at 24 weeks 125 30 -0.04 0.04 35 0.17 0.03 35 0.35 0.04 25 0.43 0.06
SE-Standard Error, LALI-low amount at low intensity, HALI-low amount at low intensity, HAHI-high amount at high intensity, NA-not assessed
Stat 466/866, Fall 2021 Due: 11:59pm Sunday, November 21
Assignment 3
3 of 5
3. Creating an html report with drill down capability
The SAS data sets clients, projects and time from the assignment 3 folder contain information from a
statistician’s consulting database. The time dataset is related to the project dataset by the ProjectID variable
which is a unique identifier for each project. The projects dataset is related to the clients dataset by the client
name (FirstName and LastName).
1. Create a temporary clients dataset sorted by LastName Firstname and a temporary projects dataset
sorted by ProjectID. While doing this have SAS check for and remove any duplicate records (i.e.
multiple client records with the exact same first and last name and multiple project records with the
same ProjectID). In a comment, tell me how many duplicates were found and removed from each
dataset.
2. Calculate the first date, last date and total hours worked for each project from the time dataset and merge
this information into the projects dataset. Only keep projects with a matching record in the time dataset.
3. Calculate the total number of projects and the total hours across all projects for each client, and merge
this information into the clients dataset. Only keep records for clients who have some data in the projects
dataset.
4. Use data step programming to generate html files named Clients.html and Projectes.html that are
identical to the files of the same name posted onQ. Notice that I report the time worked in days rather
than hours by assuming that one day is 7.5 hours. I also want you to present the days worked with one
decimal and the dates in word format as I have. If you use the exact same html tags your files should
look just like mine in a web browser. Notice that when you click on any part of a client’s name in the
Clients.html you jump to their project info as reported in Projects.html. You can view files’ raw
contents (including tags) in SAS studio by right clicking on them and selecting “View File as Text”; or if
you are viewing in Google Chrome, you can right click and choose to view “view page source”. I suggest
you add the linking tags (i.e. the anchors) after everything else is done properly. The SAS PUT statement
automatically adds a space after variable values. Use the technique we’ve seen several times to remove
these spaces. This is important since an extra space at the end of the hyperlink anchors might stop the
hyperlinks from working properly.
WARNING: You’ll need to put both Clients.html and Projectes.html in the same folder on your computer for
the hyperlinks to work. The hyperlinks won’t work properly directly from onQ or directly in the SAS OnDemand
environment or if they are Zipped, so you will need to download them to your computer (and if necessary unZip
them) to check if the hyperlinks work.
The following html tags are used in the reports.
must start and end the file.
bolds text
– hard return. Actual hard returns in the source file are not recognized by the browser but should be used to
make the source file easier to read.
main title
-Makes main title centered and in header1 style.
LastName, FirstName
-creates a
hyperlink so that when you click on the text Lastname, FirstName you jump to the LastName Firstname
destination anchor in the Projects.html file.
Creates a destination anchor in Projects.html for the previous
hyperlink to jump to.
Lastname, FirstName
– displays Lastname, FirstName in Projects.html as a level 2 header.
Stat 466/866, Fall 2021 Due: 11:59pm Sunday, November 21
Assignment 3
4 of 5
4. Creating a Macro to Examine Correlation:
a)
If x and x’ are distributed as two independent standard normal variables and y=rho*x+sqrt(1-rho2)*x’, then x and y
have a standard bivariate normal distribution with correlation=rho. From this standard bivariate distribution, you
can scale x any y to have any mean and standard deviation by simply multiplying by the standard deviation and
adding the mean. The result remains bivariate with a correlation between x and y of rho.
Create a macro named binorm that creates a data set with the random variables x and y generated from a bivariate
normal distribution. Give binorm the keyword parameters mux, muy, stdx, stdy, rho and seed that allow you to
set the means, standard deviations, correlation between x and y and the seed for the random number generator.
Have these parameters default to generate x and y from two independent (i.e. rho=0) standard normal distributions
with a seed based on the computers internal clock. Add the parameters n and sims that allow you to specify how
many observations to generate. You will generate n observations for each of sims simulations. Have n and sims
default to 100 and 1 respectively. Add a parameter named outdata that allows you to define the name of the data
set created. outdata should default to binorm. The final data set will have n*sims observations with the variables
simnum, x and y only. simnum will be 1 for the first n records, 2 for the next n records and so on.
Add a final parameter to the macro named plot that defaults to N. When plot is set to Y, have the macro generate a
simple scatter plot of y vs. x. Make the plot parameter insensitive to case and any additional letters after the first
letter so that a plot will be generated if and only if the plot parameter value that starts with y or Y.
Put a main title on the figure saying “Plot of y vs. x” and a second title that would say “n=10000, X~N(0, 1),
Y~N(0, 1), corr(x, y)=0” for the default settings, but would report the actual settings if they were different from
the default. Note that in the title N(0, 1) means normal with a mean of 0 and variance of 1 which is the standard
normal distribution, so 0 and 1 may change depending on the settings of the parameters. Define any new macro
variables created in the macro as local to the macro. Make sure no extra spaces appear in the title. When the
plot=Y, have the macro generate a separate plot for each simulation. The simulation number should appear
automatically as the third title on the plot.
Run a simulation of n= 100,000 and use PROC CORR to validate that the generated estimates are close to your
parameter settings. Try for a couple settings.
Try running this macro with default settings except turn the plot option on and run for correlations 0, 0.2, 0.5 and
0.9 respectively. Next try running five simulations of sample size 50 with plots for correlations 0 and 0.2. With a
sample size of 50 can you tell the difference between a correlation of 0 and 0.2?
b) The remainder of this assignment is for STAT866 only:
The formula given above is an easy way to generate bivariate normal data. However, this approach will not
generate multivariate normal data of more than 2 dimensions. SAS/STAT has a procedure named PROC
SIMNORMAL that is specifically designed to simulate multivariate normal data. Try using PROC SIMORMAL in
your macro instead of the data step method you just used. For this question, we will stick with the bivariate normal
distribution, so using PROC SIMNORMAL is an unnecessary complex approach, but I want to expose you to
PROC SIMNORMAL, the concept of special dataset types, passing macro variable values to SAS variables, and
Rick Wicklin’s excellent SAS blog. I suggest you read this following blog before continuing with this question:
https://blogs.sas.com/content/iml/2017/09/25/simulate-multivariate-normal-data-sas-simnormal.html .
https://blogs.sas.com/content/iml/2017/09/25/simulate-multivariate-normal-data-sas-simnormal.html
Stat 466/866, Fall 2021 Due: 11:59pm Sunday, November 21
Assignment 3
5 of 5
The trickiest part of this question is that you need to create a special TYPE=CORR dataset containing the
specifications for PROC SIMNORMAL. Bellow, I show what this dataset would look like for a bivariate normal
distribution where X and Y are correlated with rho=0.5 and X~N(2, 9) and Y~N(3, 4).
Obs _type_ _name_ X Y
1 MEAN 2.0 3.0
2 STD 3.0 2.0
3 CORR X 1.0 0.5
4 CORR Y 0.5 1.0
You may notice the dataset in Rick Whitlam’s blog contained a row where _type_=’N’. This is because his dataset
was generated by PROC CORR which includes a row indicating the original sample size, but PROC
SIMNORMAL does not use the N row. Instead, it uses an option named NUMREAL to determine how many
random observations (i.e. number of realizations) to generate. One complication with creating this special dataset
is that you cannot use instream data, because macro variables do not resolve in DATALINES. So instead, you will
have to use a data step with four OUTPUT statements to create the 4 rows required by the special TYPE=CORR
dataset as shown above.
Another issue you will have to contend with, is that you need a variable indicating the simulation number so you
can do things like produce separate plots by simulation. There are two approaches you can take to get this form
PROC SIMNORMAL. The less efficient approach is to replicate your special TYPE=CORR dataset once for each
of the sims simulations, and then run PROC SIMNOMRAL BY simnum. But a more efficient approach (which
you will need to use for full marks) is to tell PROC SIMNORMAL to generate n*sims observations and then
obtain the simulation number from the overall observation number. For example, if we wanted 5 simulations of
sample size 100, then the simnum variable should be 1 for the first 100 observations, and 2 for the next 100
observations and so on. A simple expression using the int or floor function could do this.
We’ll see in our last week of class that multivariate simulation may be easier to do using SAS PROC IML. This is
also described in Rick Wicklin’s blog at: https://blogs.sas.com/content/iml/2011/01/12/sampling-from-the-
multivariate-normal-distribution.html But for now please try to do this with PROC SIMNORMAL as described
above.
https://blogs.sas.com/content/iml/2011/01/12/sampling-from-the-multivariate-normal-distribution.html
https://blogs.sas.com/content/iml/2011/01/12/sampling-from-the-multivariate-normal-distribution.html