BIST 515: Introduction to Statistical Software
Midterm Exam
Due date: 10/22/2021 11:59PM
Complete the following problems below. Within each part, include your SAS program code,
all corresponding output, and any additional information needed to explain your answer.
1. (30 total points) This problem is a continuation of problem #2 on homework #2.
(a) (5 points) Construct a scatter plot of the 40-yard dash times vs. the bench press
weight. In your plot, include the following:
• Vary the plotting symbols and their color by the position with the following
assignments:
The specific color names are DB = black, LB = red, OL = blue, RB =
darkgreen, S = purple, TE = orange, and WO = gray.
• Gridlines
• Y and X-axis labels of “40-yard dash (seconds)” and “Bench press repeti-
tions”, respectively.
• The name of the player with the largest bench press value next to its cor-
responding plotting point. (identify the value using PROC MEANS and
manually inserted the value into a data step) (see http://blogs.sas.com/
content/iml/2011/11/11/label-only-certain-observations-with-proc-sgplot.
html).
1
http://blogs.sas.com/content/iml/2011/11/11/label-only-certain-observations-with-proc-sgplot.html
http://blogs.sas.com/content/iml/2011/11/11/label-only-certain-observations-with-proc-sgplot.html
http://blogs.sas.com/content/iml/2011/11/11/label-only-certain-observations-with-proc-sgplot.html
(b) (10 points) Now for problem (a), reconstruct the same plot without using the
manual insertion. Use a macro variable created from a data step to help you
complete the problem.
(c) (15 points) Construct a macro function that produces a scatter plot like in part
(a) again, but now for any two numerical variables in the data set. The function
should not need any changes to its code for any plot. Below are further details
regarding this function and its produced plot:
• A call to the macro function should be of this form:
%myplot(xvar, yvar);
where myplot is the name of the macro function, xvar is the variable for the
x-axis, and yvar is the variable for the yaxis.
• Make sure that appropriate title and axis labels are included on the plot
(these can use just the variable names).
• Remove any labeling of points by a corresponding player name.
• Run the macro function for Dash40 vs. BenchPress and Height vs. Weight.
2. (25 total points)“Stability testing” is performed by pharmaceutical companies to de-
termine the shelf life for drug‘ products. Typically, part of a drug batch (like a number
of pills) is put into storage in a controlled temperature and humidity environment. At
regular time points, an item is taken out of storage and testing is performed on it. A
common response measured on each item is potency. Over time, the potency of a drug
will usually degrade, so the Food and Drug Administration (FDA) has set a 95% lower
limit of the desired potency level which the drug needs to remain above. The exact
time point where the drug goes below this limit is the shelf life. This shelf life (say, 4
months) then is added to the manufacturing date of a drug to find the expiration date,
which is what consumers often see printed on drug packaging.
The shelf life is found with the help of regression models. To show how this done,
below is a simulated data set where the potency of a drug has been measured over
time in months. Suppose a single pill has been measured at each time point.
2
Time Potency
3 1.015545
6 0.9835495
9 0.9957994
12 0.9836627
15 0.986323
18 0.9945146
21 0.999571
24 0.9679062
30 0.9690051
36 0.9891509
48 0.9674187
60 0.9557498
For example, the pill taken out of storage at time 3 months had a potency of 101.55%
of the desired potency level. Using this data, complete the following problems.
(a) (5 points) Estimate and state the sample regression model with time as the ex-
planatory variable and potency as the response variable. Use proc reg to perform
the estimation and make sure that no plots are produced by the procedure. In-
terpret the relationship between time and potency as given by the model.
(b) (5 points) Use proc reg again as in part (a), but include the plot with 95%
confidence interval bands for the expected potency. No other plots should be
included in the output! I recommend using the SAS help to determine the correct
coding specification.
(c) (15 points) The FDA has guidelines to determine the shelf life of a drug. The shelf
life is the time point where the 95% confidence interval lower band plot intersects
a horizontal line drawn at a 95% potency level. In a more mathematical context,
this is represented as the X value
3
Ŷ − t0.975,n−2
√
σ̂2(
1
n
+
(X − X̄)2∑n
i=1(Xi − X̄)2
) − 0.95 = 0 (1)
where Ŷ is the estimated potency from the regression model at a time point X, n
is the sample size, t0.975,n−2 is the 0.975 quantile from a t-distribution with n− 2
degrees of freedom, and σ̂2 is the mean square error defined as
∑n
i=1(Ŷi−Yi)
2/(n−
2).
Solve the above equation for X to one decimal place of precision. There are
a number of ways that this can be done. You can use a simple grid search –
calculate the equation for a set values of X and look for which value of X leads
to the root of the equation, OR a bisection search.
3. (25 total points)Construct a macro function that uses PROC IML to: (you must use
IML to do all calculations)
(a) (5 points)Simulate a data set with a list of variables (X1, . . . , X5). (Note: (X1, . . . , X5)
should be correlated.)
(b) (5 points)Calculate the correlation matrix.Find the pair of variables that have the
highest correlation and print the result:
VARIABLES XXX AND XXX HAVE THE HIGHEST CORRELATION COEF-
FICIENT WHICH IS XXX.
(c) (5 points)Make a histogram for each variable and a scatter plot of the variables
you identified in (b). Add legend with mean and standard deviation of each
variable.
(d) (10 points) Fit the regression model for variables identified in (b). Choose any one
of the variables as the dependent variable and the other as the independent vari-
able. Your output should include estimates of intercept and slope, their standard
errors, t-ratios with P-values, and R-square.
4. (20 total points) Suppose that there is a driver’s license renewal test consists of ten
single-choice questions. Each question has five choices (A-E). Each day, we enter the
test results into a SAS data set shown below. Each observation contains a single
4
person’s answers (the data set is also available on Canvas).
ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
291192 A C C B D E D B B A
593137 B C C E E D B A A
721311 A C C B D D E B B C
345221 B C C A D B B C A D
193920 A C C B E E D B B A
257672 B C C B D D D B B A
357899 C C C B E E E B B A
564332 A C C B E E D B B A
111033 A C B D D D B B A
445732 C C C C E E D B B B
824610 B B E B E E D B B A
774235 A C C B E E D B B A
943244 C C C B E E D B B A
647893 A C C B E E E B B A
432118 A C C B E E D B B A
The correct answers for the questions are shown below:
Question: 1 2 3 4 5 6 7 8 9 10
Answer: A C C B E E D B B A
Read data and calculate the number of correct answers and scores. Note that the
first 5 questions worth 1 point each and the rest worth 2 points each. Also determine
whether each person passed or failed the test. Note if a person has 7 or more correct
answers or scored 10 or higher then he passes the test. You have to work with SAS ar-
ray and your program should print a report showing frequency and percent of passing,
and average number of correct answer and score. Also provide bar chart for percent of
correct answers for each question. Provide a notification for each person saying that
5
ID : XXXXXXXXXX
Correct : x/xx
Score : xx/xx
Based on the test score, sorry to inform you that you couldn’t pass the test. Passing
score is 7 or more correct answers or 10 or higher raw score.
OR
ID : XXXXXXXXXX
Correct : x/xx
Score : xx/xx
Based on the test score, it is pleasure to inform you that you passed the test.
6