CS计算机代考程序代写 BIST 515: Introduction to Statistical Software

BIST 515: Introduction to Statistical Software

Midterm Exam

Due date: 10/22/2021 11:59PM

Complete the following problems below. Within each part, include your SAS program code,

all corresponding output, and any additional information needed to explain your answer.

1. (30 total points) This problem is a continuation of problem #2 on homework #2.

(a) (5 points) Construct a scatter plot of the 40-yard dash times vs. the bench press

weight. In your plot, include the following:

• Vary the plotting symbols and their color by the position with the following

assignments:

The specific color names are DB = black, LB = red, OL = blue, RB =

darkgreen, S = purple, TE = orange, and WO = gray.

• Gridlines

• Y and X-axis labels of “40-yard dash (seconds)” and “Bench press repeti-

tions”, respectively.

• The name of the player with the largest bench press value next to its cor-

responding plotting point. (identify the value using PROC MEANS and

manually inserted the value into a data step) (see http://blogs.sas.com/

content/iml/2011/11/11/label-only-certain-observations-with-proc-sgplot.

html).

1

http://blogs.sas.com/content/iml/2011/11/11/label-only-certain-observations-with-proc-sgplot.html
http://blogs.sas.com/content/iml/2011/11/11/label-only-certain-observations-with-proc-sgplot.html
http://blogs.sas.com/content/iml/2011/11/11/label-only-certain-observations-with-proc-sgplot.html

(b) (10 points) Now for problem (a), reconstruct the same plot without using the

manual insertion. Use a macro variable created from a data step to help you

complete the problem.

(c) (15 points) Construct a macro function that produces a scatter plot like in part

(a) again, but now for any two numerical variables in the data set. The function

should not need any changes to its code for any plot. Below are further details

regarding this function and its produced plot:

• A call to the macro function should be of this form:

%myplot(xvar, yvar);

where myplot is the name of the macro function, xvar is the variable for the

x-axis, and yvar is the variable for the yaxis.

• Make sure that appropriate title and axis labels are included on the plot

(these can use just the variable names).

• Remove any labeling of points by a corresponding player name.

• Run the macro function for Dash40 vs. BenchPress and Height vs. Weight.

2. (25 total points)“Stability testing” is performed by pharmaceutical companies to de-

termine the shelf life for drug‘ products. Typically, part of a drug batch (like a number

of pills) is put into storage in a controlled temperature and humidity environment. At

regular time points, an item is taken out of storage and testing is performed on it. A

common response measured on each item is potency. Over time, the potency of a drug

will usually degrade, so the Food and Drug Administration (FDA) has set a 95% lower

limit of the desired potency level which the drug needs to remain above. The exact

time point where the drug goes below this limit is the shelf life. This shelf life (say, 4

months) then is added to the manufacturing date of a drug to find the expiration date,

which is what consumers often see printed on drug packaging.

The shelf life is found with the help of regression models. To show how this done,

below is a simulated data set where the potency of a drug has been measured over

time in months. Suppose a single pill has been measured at each time point.

2

Time Potency

3 1.015545

6 0.9835495

9 0.9957994

12 0.9836627

15 0.986323

18 0.9945146

21 0.999571

24 0.9679062

30 0.9690051

36 0.9891509

48 0.9674187

60 0.9557498

For example, the pill taken out of storage at time 3 months had a potency of 101.55%

of the desired potency level. Using this data, complete the following problems.

(a) (5 points) Estimate and state the sample regression model with time as the ex-

planatory variable and potency as the response variable. Use proc reg to perform

the estimation and make sure that no plots are produced by the procedure. In-

terpret the relationship between time and potency as given by the model.

(b) (5 points) Use proc reg again as in part (a), but include the plot with 95%

confidence interval bands for the expected potency. No other plots should be

included in the output! I recommend using the SAS help to determine the correct

coding specification.

(c) (15 points) The FDA has guidelines to determine the shelf life of a drug. The shelf

life is the time point where the 95% confidence interval lower band plot intersects

a horizontal line drawn at a 95% potency level. In a more mathematical context,

this is represented as the X value

3

Ŷ − t0.975,n−2


σ̂2(

1

n
+

(X − X̄)2∑n
i=1(Xi − X̄)2

) − 0.95 = 0 (1)

where Ŷ is the estimated potency from the regression model at a time point X, n

is the sample size, t0.975,n−2 is the 0.975 quantile from a t-distribution with n− 2

degrees of freedom, and σ̂2 is the mean square error defined as
∑n

i=1(Ŷi−Yi)
2/(n−

2).

Solve the above equation for X to one decimal place of precision. There are

a number of ways that this can be done. You can use a simple grid search –

calculate the equation for a set values of X and look for which value of X leads

to the root of the equation, OR a bisection search.

3. (25 total points)Construct a macro function that uses PROC IML to: (you must use

IML to do all calculations)

(a) (5 points)Simulate a data set with a list of variables (X1, . . . , X5). (Note: (X1, . . . , X5)

should be correlated.)

(b) (5 points)Calculate the correlation matrix.Find the pair of variables that have the

highest correlation and print the result:

VARIABLES XXX AND XXX HAVE THE HIGHEST CORRELATION COEF-

FICIENT WHICH IS XXX.

(c) (5 points)Make a histogram for each variable and a scatter plot of the variables

you identified in (b). Add legend with mean and standard deviation of each

variable.

(d) (10 points) Fit the regression model for variables identified in (b). Choose any one

of the variables as the dependent variable and the other as the independent vari-

able. Your output should include estimates of intercept and slope, their standard

errors, t-ratios with P-values, and R-square.

4. (20 total points) Suppose that there is a driver’s license renewal test consists of ten

single-choice questions. Each question has five choices (A-E). Each day, we enter the

test results into a SAS data set shown below. Each observation contains a single

4

person’s answers (the data set is also available on Canvas).

ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

291192 A C C B D E D B B A

593137 B C C E E D B A A

721311 A C C B D D E B B C

345221 B C C A D B B C A D

193920 A C C B E E D B B A

257672 B C C B D D D B B A

357899 C C C B E E E B B A

564332 A C C B E E D B B A

111033 A C B D D D B B A

445732 C C C C E E D B B B

824610 B B E B E E D B B A

774235 A C C B E E D B B A

943244 C C C B E E D B B A

647893 A C C B E E E B B A

432118 A C C B E E D B B A

The correct answers for the questions are shown below:

Question: 1 2 3 4 5 6 7 8 9 10

Answer: A C C B E E D B B A

Read data and calculate the number of correct answers and scores. Note that the

first 5 questions worth 1 point each and the rest worth 2 points each. Also determine

whether each person passed or failed the test. Note if a person has 7 or more correct

answers or scored 10 or higher then he passes the test. You have to work with SAS ar-

ray and your program should print a report showing frequency and percent of passing,

and average number of correct answer and score. Also provide bar chart for percent of

correct answers for each question. Provide a notification for each person saying that

5

ID : XXXXXXXXXX

Correct : x/xx

Score : xx/xx

Based on the test score, sorry to inform you that you couldn’t pass the test. Passing

score is 7 or more correct answers or 10 or higher raw score.

OR

ID : XXXXXXXXXX

Correct : x/xx

Score : xx/xx

Based on the test score, it is pleasure to inform you that you passed the test.

6