STATS 785
1 of 9
THE UNIVERSITY OF AUCKLAND
SUMMER SCHOOL, 2020
Campus: City
STATISTICS
Statistical Programming and Modelling using SAS
(Time allowed: THREE hours)
NOTE: Attempt all questions. Show all working. For questions that require calculation the majority of
the marks are awarded for correct working. These marks will not be awarded if the working is not
shown, even if the answer is correct.
A separate Appendices Booklet contains SAS code, analyses and formulae.
Read this very carefully.
Total marks: 100.
STATS 785
2 of 9
1. [10 marks] PROC FREQ was applied to births data collected in the National Women’s Hospital
over 1968-1987, where frequencies were recorded with respect to the race of the mother and the
sex of the new-born baby. Some values have been suppressed, denoted by AAAAA, BBBBB,
CCCCC, and DDDDD.
These Tables are shown in APPENDIX A on page 2 of the APPENDICES BOOKLET. Using
these outputs, answer the following:
(a) Compute AAAA. [1 mark]
(b) Compute BBBB. [1 mark]
(c) Compute CCCC. [1 mark]
(d) Compute DDDD. [1 mark]
(e) State the null hypothesis H0 and the alternative hypothesis H1 for the Chi-squared test.
Define any notation you use. [1 mark]
(f) Is it valid to use the Chi-square test result? Explain. [1 mark]
(g) Write down the conclusion (in plain words) that you can draw from the Chi-square result. [1
mark]
(h) Write a SAS program (including the code creating the SAS dataset for the 2 x 6 table) to
produce the above output. [3 marks]
STATS 785
3 of 9
2. [13 marks] One method of measuring genetic damage to an individual’s DNA is to calculate the mean
number of sister chromatid exchanges (MSCE) observed per cell. MSCE levels for 32 individuals were
classified by race.
The SAS code and output are shown in APPENDIX B on pages 3 to 5 of the APPENDICES
BOOKLET. Use this information to answer the following questions.
(a) Compute the missing value AAAAA. [1 mark]
(b) Compute the missing value BBBBB. [1 mark]
(c) Compute the missing value CCCCC. [1 mark]
(d) Compute the missing value DDDDD. [1 mark]
(e) Write down the null and alternative hypotheses for this model. [1 mark]
(f) Calculate the mean sister chromatid exchange levels in each ethnic group? [1 mark]
(g) What do you conclude from the above output? [1 mark]
(h) Write a Methods and Assumptions check section for this data. [3 marks]
(i) Write an Executive Summary. [3 marks]
STATS 785
4 of 9
3. [12 marks] A randomised experiment was conducted to examine the density of Sydney rock
oysters under various conditions. The density of the oysters was measured for different patch sizes
and different predator control regimes. The SAS code and output to this question is shown in
APPENDIX C on pages 6 to 10 of the APPENDICES BOOKLET.
(a) Write a Methods and Assumption Checks section for the analysis of the Oyster data
model in APPENDIX C of the APPENDICES BOOKLET. [5 marks] Remember to
include the equation of the best model in the analysis.
(b) Write an Executive Summary for the analysis of the Oyster data in APPENDIX C of the
APPENDICES BOOKLET. [7 marks]
4. [12 marks] The following questions involve the analysis of the Baseball Data I detailed in
APPENDIX D on pages 11 to 17.
(a) Comment on the scatter plot of the data. [2 marks]
(b) Write a Methods and Assumption Checks section for the analysis of the Baseball Data I
in APPENDIX D on pages 11 to 17 of the APPENDICES BOOKLET. Remember to
include the equation of the best model in the analysis. [4 marks]
(c) Write an Executive Summary for the analysis of the Baseball Data I in APPENDIX D of
the APPENDICES BOOKLET. [5 marks]
(d) Use your model to estimate the predicted salary for a baseball player with 10 years in the
major league (Show all working). Interpret the prediction interval. How useful do you think
this model will be for prediction? Give two reasons justifying your answer. (You do not
need to include this prediction in the Executive Summary.) [1 mark]
STATS 785
5 of 9
5. [15 marks] The following questions involve the analysis of the Baseball Data II detailed in
APPENDIX E on pages 18 to 27.
(a) Comment on the correlations and pair plots of the data. [2 marks]
(b) Write a Methods and Assumption Checks section for the analysis of the Baseball Data II
in APPENDIX E. Remember to include the equation of the best model in the analysis. [5
marks]
(c) Write an Executive Summary for the analysis of the Baseball Data II in APPENDIX E.
[5 marks]
(d) Use your model to estimate the predicted salary for a baseball player with the following
values. (Show all working).
nRBIs = 50
nRuns = 50
YrMajor = 10
CrHits = 750
(You do not need to include this prediction in the Executive Summary.) [3 marks]
STATS 785
6 of 9
6. [13 marks] The space shuttle solid-fuel rockets have a total of 6 O-rings. It was suspected that O-
ring reliability was influenced by temperature.
The SAS code and output to this question is shown in APPENDIX F on pages 29 to 31 of the
APPENDICES BOOKLET.
We are interested in the proportion (or number of) O-rings that failed and the temperature during
the launch. We wish to quantify what was the influence of the temperature in the probability of
having at least one incident related to the O-rings. Specifically, we want to address the following
questions:
Q1. Is the temperature associated with O-ring incidents?
Q2. In which way was the temperature affecting the probability of O-ring incidents?
Q3. What was the predicted probability of an incident in an O-ring for the temperature of the
launch day?
(a) Write a Methods and Assumption Checks section for the analysis of the Challenger data
model in the analysis in APPENDIX F of the APPENDICES BOOKLET. [5 marks]
Remember to include the equation of the model used in the analysis.
(b) Calculate the predicted probability of an incident in an O-ring at -0.6oC (the temperature of
the launch day) and how many distressed O-rings does this correspond to? Include all
working to both parts of this question. [2 marks]
(c) Give an Executive Summary for the analysis of the Challenger data in APPENDIX F of
the APPENDICES BOOKLET. [6 marks]
STATS 785
7 of 9
7. [10 marks] The following data are counts of snapper recorded by a baited underwater video camera
in 1998. There are 55 counts, of which 41 were taken inside the Leigh Marine Reserve and 14 were
taken in areas immediately next to the reserve.
Inside reserve:
0 0 4 0 0 3 0 0 0 0 3 8 1 5
4 1 5 2 1 11 12 4 4 3 6 9 6 6
2 2 5 0 7 2 8 2 2 1 9 1 1
Outside reserve:
0 0 0 0 0 0 1 0 1 0 3 2 2 5
It is desired to make inference about the ratio of snapper density inside the reserve versus outside,
that is , where is the mean density of snapper in the reserve and is the mean density
of snapper outside the reserve. This ratio will be estimated using T(X) = where is the
sample mean of the counts inside the reserve and is the sample mean of the counts outside the
reserve. A SAS program to calculate this is:
DATA parta(KEEP=tx);
*Reserve counts;
ARRAY r {41} (0 0 4 0 0 3 0 0 0 0 3 8 1 5
4 1 5 2 1 11 12 4 4 3 6 9 6 6
2 2 5 0 7 2 8 2 2 1 9 1 1);
*Non-reserve counts;
ARRAY nr {14} (0 0 0 0 0 0 1 0 1 0 3 2 2 5);
tx = MEAN(OF r1-r41)/MEAN(OF nr1-nr14);
RUN;
PROC PRINT NOOBS;
TITLE ‘Observed value of T(X)’;
RUN;
DATA snapper(KEEP=txstar);
*Reserve counts;
ARRAY r {41} (0 0 4 0 0 3 0 0 0 0 3 8 1 5
4 1 5 2 1 11 12 4 4 3 6 9 6 6
2 2 5 0 7 2 8 2 2 1 9 1 1);
*Non-reserve counts;
ARRAY nr {14} (0 0 0 0 0 0 1 0 1 0 3 2 2 5);
ARRAY rstar {41};
ARRAY nrstar {14};
DO iter = 1 TO 100000;
DO i = 1 TO 41;
rstar[i] = r[ ROUND(0.5+41*RANUNI(0)) ];
END;
DO i = 1 TO 14;
nrstar[i] = nr[ ROUND(0.5+14*RANUNI(0)) ];
END;
txstar = MEAN(OF rstar1-rstar41)/MEAN(OF nrstar1-nrstar14);
IF txstar = . THEN txstar = 999999;
OUTPUT;
STATS 785
8 of 9
END;
PROC UNIVARIATE NOPRINT;
OUTPUT OUT=bootsum N=numsamps PCTLPRE=pctl PCTLPTS=2.5 97.5;
TITLE ‘95% percentile confidence interval’;
RUN;
PROC PRINT NOOBS LABEL;
LABEL numsamps = “Number of resamples”;
RUN;
Which produces
Observed value of T(X)
tx
3.41463
95% percentile confidence interval
Number
of
resamples
the
2.5000
percentile,
txstar
the
97.5000
percentile,
txstar
100000 1.70732 11.0634
(a) What is the observed value of T(X). [1 mark]
(b) Provide an approximate 95% CI for using the bootstrap percentile method. [1 mark]
(c) Provide an approximate 95% CI for using the bootstrap basic method applied to:
T2(X) = log( ) as an estimator of log( ). [8 marks]
8. [5 marks] Suppose, in the context of testing for a difference between two groups using the
Wilcoxon rank-sum test, that the two samples have five observations each, and the rank-sum
statistic is S = 17.
(a) What is the two-sided p-value for testing the null hypothesis of no difference between the
two groups? [3 marks]
(b) Carry out the normal approximation for the Wilcoxon rank-sum test. Tables of the Normal
Distribution are given in APPENDIX G on Page 32. [2 marks]
STATS 785
9 of 9
9. [10 marks] Twelve subjects were classified into a 2 by 2 contingency table, giving
(a) What are the expected counts under the hypothesis of no association between treatment
and outcome? [2 marks]
(b) List all other possible tables that are consistent with the observed row and column totals.
[2 marks]
(c) For the observed table and each table listed above, calculate the number of ways of
assigning the 12 subjects that will result in that table. [4 marks]
(d) What is Fisher’s exact p-value for the hypothesis of no association between treatment and
outcome? [2 marks]
Some factorials:
0! = 1 10! = 3628800 20! = 2.4329020 × 1018
1! = 1 11! = 3991680 21! = 5.1090942 × 1019
2! = 2 12! = 4.7900160 × 108 22! = 1.1240007 × 1021
3! = 6 13! = 6.2270208 × 109 23! = 2.5852017 × 1022
4! = 24 14! = 8.7178291 × 1010 24! = 6.2044840 × 1023
5! = 120 15! = 1.3076744 × 1012 25! = 1.5511210 × 1025
6! = 720 16! = 2.0922790 × 1013 26! = 4.0329146 × 1026
7! = 5040 17! = 3.5568743 × 1014 27! = 1.0888869 × 1028
8! = 40320 18! = 6.4023737 × 1015 28! = 3.0488834 × 1029
9! = 362880 19! = 1.2164510 × 1017 29! = 8.8417620 × 1030
APPENDIX STATS 785
Page 1 of 33
THE UNIVERSITY OF AUCKLAND
SUMMER SCHOOL, 2020
Campus: City
STATISTICS
Statistical Programming and Modelling using SAS
APPENDICES BOOKLET
CONTENTS
APPENDIX A: Race of mother and sex of baby page 2
APPENDIX B: Genetic Damage data pages 3 to 5
APPENDIX C: Oyster data pages 6 to 10
APPENDIX D: Baseball data I pages 11 to 17
APPENDIX E: Baseball data II pages 18 to 27
APPENDIX F: Space Challenger data pages 28 to 31
APPENDIX G: Standard Normal Distribution page 32
APPENDIX H: Formulae page 33
APPENDIX STATS 785
Page 2 of 33
Appendix A: Race of mother and sex of baby
PROC FREQ was applied to births data collected in the National Women’s
Hospital over 1968-1987, where frequencies were recorded with respect to the race
of the mother and the sex of the newborn baby. Some values have been suppressed,
denoted by AAAAA, BBBBB, CCCCC, and DDDDD.
The variables are:
Race Ethnic group of mother.
sex Gender (Male or Female).
Table of sex by race
sex race
Frequency
Expected
Cell Chi-Square Chinese European Indian Maori Other PacificIsland Total
Females AAAAA
430.81
0.7685
30848
BBBBB
0.0737
531
494.82
CCCCC
6674
6644.5
0.1309
361
361.01
271E-9
5521
5557.1
0.2351
44384
Males 446
464.19
0.7132
33338
33290
0.0684
497
533.18
2.4545
7130
7159.5
0.1215
389
388.99
251E-9
6024
5987.9
0.2182
47824
Total 895 64186 1028 13804 750 11545 92208
Statistic DF Value Prob
Chi-Square 5 DDDDD 0.1907
Likelihood Ratio Chi-Square 5 7.4229 0.1910
Mantel-Haenszel Chi-Square 1 0.0987 0.7534
Phi Coefficient 0.0090
Contingency Coefficient 0.0090
Cramer’s V 0.0090
APPENDIX STATS 785
Page 3 of 33
Appendix B. Genetic Damage data
One method of measuring genetic damage to an individual’s DNA is to calculate
the mean number of sister chromatid exchanges (MSCE) observed per cell.
MSCE levels for 32 individuals were classified by race.
The variables measured were:
Race Asian, Black, Caucasian and Native American
MSCE the mean number of sister chromatid exchanges per cell.
The SAS code and output follow:
PROC FORMAT;
VALUE racefmt 1 = ‘Black’
2 = ‘Caucasian’
3 = ‘Native American’
4 = ‘Asian’;
RUN;
PROC GLM PLOTS=DIAGNOSTICS(UNPACK);
CLASS race;
MODEL MSCE = race / SOLUTION;
LSMEANS race / STDERR PDIFF;
FORMAT race racefmt.;
RUN;
The GLM Procedure
Class Level Information
Class Levels Values
race 4 Asian Black Caucasian Native American
Number of Observations Read 32
Number of Observations Used 32
Source DF
Sum of
Squares Mean Square F Value Pr > F
Model 3 2.13604013 AAAAA CCCCC 0.1801
Error 28 11.40830675 BBBBB
Corrected Total 31 DDDDD
APPENDIX STATS 785
Page 4 of 33
R-Square Coeff Var Root MSE MSCE Mean
0.157707 7.591866 0.638310 8.407813
Source DF Type I SS Mean Square F Value Pr > F
race 3 2.13604013 0.71201338 1.75 0.1801
Source DF Type III SS Mean Square F Value Pr > F
race 3 2.13604013 0.71201338 1.75 0.1801
Parameter Estimate
Standard
Error t Value Pr > |t|
Intercept 8.572857143 B 0.24125846 35.53 <.0001
race Asian 0.185892857 B 0.33035676 0.56 0.5781
race Black -0.349107143 B 0.33035676 -1.06 0.2997
race Caucasian -0.441746032 B 0.32167795 -1.37 0.1806
race Native American 0.000000000 B . . .
Note: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates
followed by the letter 'B' are not uniquely estimable.
APPENDIX STATS 785
Page 5 of 33
race MSCE LSMEAN
Standard
Error Pr > |t|
LSMEAN
Number
Asian Answered in 4f 0.22567663 <.0001 1 Black Answered in 4f 0.22567663 <.0001 2 Caucasian Answered in 4f 0.21276997 <.0001 3 Native American Answered in 4f 0.24125846 <.0001 4 Least Squares Means for effect race Pr > |t| for H0: LSMean(i)=LSMean(j)
Dependent Variable: MSCE
i/j 1 2 3 4
1 0.1048 0.0527 0.5781
2 0.1048 0.7674 0.2997
3 0.0527 0.7674 0.1806
4 0.5781 0.2997 0.1806
APPENDIX STATS 785
Page 6 of 33
Appendix C. Oyster data
A randomised experiment was conducted to examine the density of Sydney rock oysters
under various conditions. The density of the oysters was measured for different patch
sizes and different predator control regimes.
The variables measured were:
Oysters: the number of oysters per 1010 cm area
Size: the size of the patches: small, medium or large
Cage: the different predator control regimes: full, partial or open.
The SAS code and output follows:
PROC BOXPLOT;
PLOT oysters*cell;
RUN;
Where fulllar = fulllarge, partlar = partiallarge, openlar =
openlarge, fullmed = fullmedium, partmed = partialmedium, openmed =
openmedium, fullsma = fullsmall, partsma = partialsmall, opensma =
opensmall.
APPENDIX STATS 785
Page 7 of 33
PROC GLM PLOTS=ALL;
CLASS cage size;
MODEL oysters=cage|size;
RUN;
The GLM Procedure
Dependent Variable: oysters
Class Level Information
Class Levels Values
cage 3 full open part
size 3 lar med sma
Number of Observations Read 54
Number of Observations Used 54
Source DF
Sum of
Squares Mean Square F Value Pr > F
Model 8 2224.634259 278.079282 9.75 <.0001 Error 45 1283.010417 28.511343 Corrected Total 53 3507.644676 R-Square Coeff Var Root MSE oysters Mean 0.634225 36.41787 5.339601 14.66204 Source DF Type I SS Mean Square F Value Pr > F
cage 2 1986.516204 993.258102 34.84 <.0001 size 2 14.113426 7.056713 0.25 0.7818 cage*size 4 224.004630 56.001157 1.96 0.1163 APPENDIX STATS 785 Page 8 of 33 PROC GLM PLOTS=ALL; CLASS cage size; MODEL oysters = cage size; LSMEANS cage size / ADJUST=TUKEY PDIFF STDERR CL; RUN; The GLM Procedure Dependent Variable: oysters Source DF Sum of Squares Mean Square F Value Pr > F
Model 4 2000.629630 500.157407 16.26 <.0001 Error 49 1507.015046 30.755409 Corrected Total 53 3507.644676 APPENDIX STATS 785 Page 9 of 33 R-Square Coeff Var Root MSE oysters Mean 0.570363 37.82391 5.545756 14.66204 Source DF Type I SS Mean Square F Value Pr > F
cage 2 1986.516204 993.258102 32.30 <.0001 size 2 14.113426 7.056713 0.23 0.7958 Source DF Type III SS Mean Square F Value Pr > F
cage 2 1986.516204 993.258102 32.30 <.0001
size 2 14.113426 7.056713 0.23 0.7958
The GLM Procedure
Least Squares Means
Adjustment for Multiple Comparisons: Tukey
cage
oysters
LSMEAN
Standard
Error Pr > |t|
LSMEAN
Number
full 22.9722222 1.3071472 <.0001 1 open 8.6666667 1.3071472 <.0001 2 part 12.3472222 1.3071472 <.0001 3 APPENDIX STATS 785 Page 10 of 33 Least Squares Means for effect cage Pr > |t| for H0: LSMean(i)=LSMean(j)
Dependent Variable: oysters
i/j 1 2 3
1 <.0001 <.0001 2 <.0001 0.1252 3 <.0001 0.1252 cage oysters LSMEAN 95% Confidence Limits full 22.972222 20.345412 25.599033 open 8.666667 6.039856 11.293477 part 12.347222 9.720412 14.974033 Least Squares Means for Effect cage i j Difference Between Means Simultaneous 95% Confidence Limits for LSMean(i)-LSMean(j) 1 2 14.305556 9.837738 18.773373 1 3 10.625000 6.157182 15.092818 2 3 -3.680556 -8.148373 0.787262 APPENDIX STATS 785 Page 11 of 33 APPENDIX D: Baseball Salaries data I The data set contains salary and performance information for Major League Baseball players who played at least one game in both the 1986 and 1987 seasons, excluding pitchers. The salaries are for the 1987 season and the performance measures are from 1986. The variables are: Salary: 1987 Salary in $ Thousands YrMajor Years in the Major league The aim of this question of to determine whether the number of years in the major league has any influence on the salary of the baseball player. PROC LOESS PLOTS DATA=ass2; MODEL Salary = YrMajor / SMOOTH=0.67 CLM; RUN; PROC REG DATA=exam PLOTS=DIAGNOSTICS(UNPACK); MODEL Salary = yrmajor; RUN; APPENDIX STATS 785 Page 12 of 33 The REG Procedure Model: MODEL1 Dependent Variable: Salary 1987 Salary in $ Thousands Number of Observations Read 322 Number of Observations Used 263 Number of Observations with Missing Values 59 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F
Model 1 10458197 10458197 63.68 <.0001 Error 261 42860916 164218 Corrected Total 262 53319113 APPENDIX STATS 785 Page 13 of 33 Root MSE 405.23828 R-Square 0.1961 Dependent Mean 535.92588 Adj R-Sq 0.1931 Coeff Var 75.61461 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 227.65284 46.00686 4.95 <.0001 YrMajor Years in the Major Leagues 1 41.70566 5.22609 7.98 <.0001 DATA exam; SET exam; LogSalary = LOG(Salary); RUN; PROC LOESS PLOTS DATA=ass2a; MODEL LogSalary = YrMajor / SMOOTH=0.67 CLM; RUN; PROC REG DATA= exam PLOTS=DIAGNOSTICS(UNPACK); MODEL LogSalary = yrmajor; OUTPUT OUT=resid R=residual P=predicted; RUN; PROC LOESS PLOTS DATA=resid; MODEL residual = predicted / SMOOTH=0.67 CLM; RUN; APPENDIX STATS 785 Page 14 of 33 DATA exam; SET exam; YrMajor2 = YrMajor**2; RUN; PROC REG DATA=ass2a PLOTS(LABEL)=DIAGNOSTICS(UNPACK); ID name team league; MODEL LogSalary = YrMajor YrMajor2; OUTPUT OUT=resid R=residual P=predicted; RUN; PROC LOESS PLOTS DATA=resid; MODEL residual = predicted / SMOOTH=0.67 CLM; RUN; APPENDIX STATS 785 Page 15 of 33 The REG Procedure Model: MODEL1 Dependent Variable: logSalary Log Salary Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 106.41383 53.20692 137.32 <.0001 Error 260 100.73990 0.38746 Corrected Total 262 207.15373 Root MSE 0.62246 R-Square 0.5137 Dependent Mean 5.92722 Adj R-Sq 0.5100 Coeff Var 10.50178 APPENDIX STATS 785 Page 16 of 33 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 4.24333 0.11368 37.33 <.0001 YrMajor Years in the Major Leagues 1 0.38788 0.02885 13.44 <.0001 YrMajor2 1 -0.01527 0.00149 -10.22 <.0001 PROC REG DATA=exam; PLOTS=DIAGNOSTICS(UNPACK)PLOTS=PREDICTIONS(X=YrMajor); MODEL LogSalary = YrMajor YrMajor2; WHERE name^="Rose, Pete"; RUN; DATA add; INPUT YrMajor YrMajor2 name $; CARDS; 10 100 Pred ; RUN; DATA both; SET exam add; RUN; Predictions and Residuals for logSalary 95% Prediction Limits95% Confidence LimitsFit -2 -1 0 1 2 R es id u al 4 5 6 7 8 L o g S al ar y 0 5 10 15 20 Years in the Major Leagues APPENDIX STATS 785 Page 17 of 33 PROC REG DATA=both; MODEL LogSalary = YrMajor YrMajor2; OUTPUT OUT=resid R=residual P=predicted ucl=CLUpper lcl = CLLower; WHERE name^="Rose, Pete"; RUN; DATA resid2; SET resid; IF name = 'Pred'; PredSal = EXP(Predicted); lcl=EXP(CLLower); ucl = EXP(CLUpper); RUN; PROC PRINT DATA=resid2; VAR YrMajor YrMajor2 Predicted CLLower CLUpper PredSal lcl ucl; WHERE name = 'Pred'; RUN; Obs YrMajor YrMajor2 predicted CLLower CLUpper PredSal lcl ucl 1 10 100 6.63913 5.43438 7.84389 764.432 229.150 2550.11 APPENDIX STATS 785 Page 18 of 33 APPENDIX E: Baseball Salaries Data II The data set contains salary and performance information for Major League Baseball players who played at least one game in both the 1986 and 1987 seasons, excluding pitchers. The salaries are for the 1987 season and the performance measures are from 1986. The variables are: Salary: 1987 Salary in $ Thousands YrMajor Years in the Major league nhits Hits in 1986 nruns Runs in 1986 nrbi RBIs in 1986 (Runs Batted In) nbb Walks in 1986 crhits Career hits The aim of this question of to determine which variables have any influence on the salary of the baseball player. PROC CORR DATA=ass2; VAR salary nhits nruns nrbi nbb yrmajor crhits; RUN; PROC SGSCATTER DATA=ass2; MATRIX Salary nhits nruns nrbi nbb yrmajor crhits / DIAGONAL=(HISTOGRAM); RUN; APPENDIX STATS 785 Page 19 of 33 Pearson Correlation Coefficients, N = 263 Prob > |r| under H0: Rho=0
Salary nHits nRuns nRBI nBB YrMajor CrHits
Salary
1987 Salary in $ Thousands
1.00000
0.50136
<.0001
0.47903
<.0001
0.51723
<.0001
0.50462
<.0001
0.44288
<.0001
0.59221
<.0001
nHits
Hits in 1986
0.50136
<.0001
1.00000
0.90571
<.0001
0.77829
<.0001
0.57116
<.0001
-0.00097
0.9875
0.22217
0.0003
nRuns
Runs in 1986
0.47903
<.0001
0.90571
<.0001
1.00000
0.77059
<.0001
0.68775
<.0001
-0.02846
0.6460
0.17966
0.0035
nRBI
RBIs in 1986
0.51723
<.0001
0.77829
<.0001
0.77059
<.0001
1.00000
0.56181
<.0001
0.12460
0.0435
0.29005
<.0001
nBB
Walks in 1986
0.50462
<.0001
0.57116
<.0001
0.68775
<.0001
0.56181
<.0001
1.00000
0.12735
0.0390
0.26615
<.0001
YrMajor
Years in the Major Leagues
0.44288
<.0001
-0.00097
0.9875
-0.02846
0.6460
0.12460
0.0435
0.12735
0.0390
1.00000
0.89803
<.0001
CrHits
Career Hits
0.59221
<.0001
0.22217
0.0003
0.17966
0.0035
0.29005
<.0001
0.26615
<.0001
0.89803
<.0001
1.00000
APPENDIX STATS 785
Page 20 of 33
ODS GRAPHICS ON;
PROC REG DATA=ass2a PLOTS(LABEL)=ALL;;
MODEL logsalary = nhits nruns nrbi nbb yrmajor crhits;
ID name team league;
RUN;
ODS GRAPHICS OFF;
Career HitsYears in th...Walks in 1...RBIs in 19...Runs in 19...Hits in 19861987 Salar...
C
ar
ee
r
H
its
Y
ea
rs
in
t
h.
..
W
al
ks
in
1
..
.
R
B
Is
in
1
9.
..
R
un
s
in
1
9.
..
H
its
in
1
98
6
19
87
S
al
ar
..
.
APPENDIX STATS 785
Page 21 of 33
APPENDIX STATS 785
Page 22 of 33
DATA ass2a;
SET ass2(WHERE=(name^="Rose, Pete"));
yrMajor2 = YrMajor**2;
CrHits2 = CrHits**2;
RUN;
ODS GRAPHICS ON;
PROC REG DATA=ass2a PLOTS(LABEL)=(ALL RESIDUALS(SMOOTH));
MODEL logsalary = nhits nruns nrbi nbb yrmajor yrmajr2 crhits
crhits2;
ID name;
RUN;
ODS GRAPHICS OFF;
Root MSE 0.41673 R-Square 0.7874
Dependent Mean 5.92458 Adj R-Sq 0.7807
Coeff Var 7.03393
APPENDIX STATS 785
Page 23 of 33
Parameter Estimates
Variable Label DF
Parameter
Estimate
Standard
Error t Value Pr > |t|
Intercept Intercept 1 3.78922 0.11581 32.72 <.0001 nHits Hits in 1986 1 -0.00012753 0.00159 -0.08 0.9363 nRuns Runs in 1986 1 0.00215 0.00288 0.75 0.4549 nRBI RBIs in 1986 1 0.00431 0.00172 2.51 0.0127 nBB Walks in 1986 1 0.00501 0.00173 2.90 0.0040 YrMajor Years in the Major Leagues 1 0.23908 0.03443 6.94 <.0001 yrMajor2 1 -0.01440 0.00165 -8.73 <.0001 CrHits Career Hits 1 0.00170 0.00027562 6.18 <.0001 CrHits2 1 -3.31739E-7 1.001272E-7 -3.31 0.0011 Fit Diagnostics for logSalary 0.5316Adj R-Square 0.5352R-Square 0.3709MSE 259Error DF 3Parameters 262Observations Proportion Less 0.0 0.4 0.8 Residual 0.0 0.4 0.8 Fit–Mean -2 -1 0 1 2 -2 -1.2 -0.4 0.4 1.2 2 Residual 0 5 10 15 20 25 P er ce n t 0 100 200 300 Observation 0.00 0.02 0.04 0.06 C o o k' s D 4 5 6 7 8 Predicted Value 4 5 6 7 8 L o g S al ar y -3 -2 -1 0 1 2 3 Quantile -2 -1 0 1 2 R es id u al 0.00 0.04 0.08 Leverage -4 -2 0 2 R S tu d en t 4.5 5.0 5.5 6.0 6.5 Predicted Value -4 -2 0 2 R S tu d en t 4.5 5.0 5.5 6.0 6.5 Predicted Value -2 -1 0 1 2 R es id u al APPENDIX STATS 785 Page 24 of 33 PROC REG DATA=ass2a; MODEL logsalary = nruns nrbi nbb yrmajor yrmajor2 crhits crhits2; RUN; Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 3.78600 0.10840 34.93 <.0001 nRuns Runs in 1986 1 0.00199 0.00205 0.97 0.3325 nRBI RBIs in 1986 1 0.00427 0.00163 2.62 0.0092 nBB Walks in 1986 1 0.00504 0.00168 3.00 0.0030 YrMajor Years in the Major Leagues 1 0.23940 0.03413 7.01 <.0001 yrMajor2 1 -0.01440 0.00165 -8.74 <.0001 CrHits Career Hits 1 0.00170 0.00026220 6.47 <.0001 CrHits2 1 -3.30148E-7 9.794245E-8 -3.37 0.0009 PROC REG DATA=ass2a; MODEL logsalary = nrbi nbb yrmajor yrmajor2 crhits crhits2; RUN; Root MSE 0.41587 R-Square 0.7866 Dependent Mean 5.92458 Adj R-Sq 0.7816 Coeff Var 7.01937 APPENDIX STATS 785 Page 25 of 33 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 3.82206 0.10183 37.53 <.0001
nRBI RBIs in 1986 1 0.00522 0.00130 4.03 <.0001
nBB Walks in 1986 1 0.00582 0.00148 3.92 0.0001
YrMajor Years in the Major Leagues 1 0.23188 0.03324 6.98 <.0001
yrMajor2 1 -0.01431 0.00164 -8.70 <.0001
CrHits Career Hits 1 0.00178 0.00024700 7.21 <.0001
CrHits2 1 -3.53159E-7 9.50212E-8 -3.72 0.0002
PROC REG DATA=ass2a;
MODEL logsalary = nrbi nbb yrmajor crhits / VIF;
RUN;
Parameter Estimates
Variable Label DF
Parameter
Estimate
Standard
Error t Value Pr > |t|
Variance
Inflation
Intercept Intercept 1 4.54564 0.12004 37.87 <.0001 0 nRBI RBIs in 1986 1 0.00748 0.00184 4.07 <.0001 1.64693 nBB Walks in 1986 1 0.00869 0.00211 4.12 <.0001 1.51646 YrMajor Years in the Major Leagues 1 0.02663 0.01944 1.37 0.1720 6.11568 CrHits Career Hits 1 0.00059538 0.00015734 3.78 0.0002 6.83256 %LET ind = nrbi nbb yrmajor yrmajor2 crhits crhits2; PROC REG DATA=ass2a PLOTS=NONE; MODEL logsalary = &ind; OUTPUT OUT=pred P=p; RUN; %marginal(dependent=logSalary, predicted=p, independents=&ind) APPENDIX STATS 785 Page 26 of 33 Marginal Models for logSalary Model Data Predicted Values CrHits2Career HitsyrMajor2 Years in the Major LeaguesWalks in 1986RBIs in 1986 APPENDIX STATS 785 Page 27 of 33 DATA ADD; INPUT nrbi nbb yrmajor crhits; yrmajor2 = yrmajor**2; crhits2 = crhits**2; CARDS; 50 50 10 750 ; RUN; DATA both; SET ass2a add; RUN; PROC REG DATA=both; MODEL logsalary = nrbi nbb yrmajor yrmajor2 crhits crhits2; OUTPUT OUT=resid R=residual P=predicted ucl=CLUpper lcl=CLLower; RUN; DATA resid1; SET resid; est= EXP(predicted); lcl = EXP(CLLower); ucl = EXP(CLUpper); RUN; PROC PRINT; VAR nrbi nbb yrmajor crhits Salary est lcl ucl; WHERE nhits = .; RUN; Obs nRBI nBB YrMajor CrHits Salary est lcl ucl 322 50 50 10 750 . 601.539 263.690 1372.25 APPENDIX STATS 785 Page 28 of 33 Appendix F: Challenger Data The space shuttle solid-fuel rockets have a total of 6 O-rings. It was suspected that O-ring reliability was influenced by temperature. The variables of interest were: • fail_field: the number in the O-rings that failed. • total: the total number of O-rings (= 6 here). The SAS code and output follows: SYMBOL1 VALUE=CIRCLE; PROC GPLOT; PLOT propn*temp; RUN; PROC GENMOD DATA=challenger PLOTS=RESDEV; MODEL fail_field/total = temp / DIST=B; RUN; APPENDIX STATS 785 Page 29 of 33 The GENMOD Procedure Model Information Data Set WORK.CHALLENGER Distribution Binomial Link Function Logit Response Variable (Events) fail_field Response Variable (Trials) total Number of Observations Read 23 Number of Observations Used 23 Number of Events 10 Number of Trials 138 Response Profile Ordered Value Binary Outcome Total Frequency 1 Event 10 2 Nonevent 128 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 21 19.1362 0.9112 Scaled Deviance 21 19.1362 0.9112 Pearson Chi-Square 21 34.3574 1.6361 Scaled Pearson X2 21 34.3574 1.6361 Log Likelihood -31.0629 Full Log Likelihood -16.4003 AIC (smaller is better) 36.8007 AICC (smaller is better) 37.4007 BIC (smaller is better) 39.0717 Algorithm converged. APPENDIX STATS 785 Page 30 of 33 Analysis Of Maximum Likelihood Parameter Estimates Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi- Square Pr > ChiSq
Intercept 1 2.2782 1.5160 -0.6931 5.2494 2.26 0.1329
temp 1 -0.2514 0.0838 -0.4157 -0.0871 8.99 0.0027
Scale 0 1.0000 0.0000 1.0000 1.0000
DATA temp;
p = 1 – PROBCHI(19.1632, 21);
RUN;
PROC PRINT;
RUN;
Obs p
1 0.57467
APPENDIX STATS 785
Page 31 of 33
DATA add;
INPUT temp;
CARDS;
-0.6
;
DATA both;
SET challenger add;
RUN;
PROC GENMOD DATA=both;
MODEL fail_field/total = temp / DIST=B;
OUTPUT OUT=pred P = pred lower=lower upper=upper;
RUN;
PROC PRINT DATA=pred;
VAR pred lower upper;
WHERE propn = .;
RUN;
Obs pred lower upper
24 0.91901 0.34562 0.99592
APPENDIX STATS 785
Page 32 of 33
STANDARD NORMAL DISTRIBUTION: Table Values Represent AREA to the LEFT of the Z score.
Appendix G: Standard Normal Distribution
Z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
-3.9 .00005 .00005 .00004 .00004 .00004 .00004 .00004 .00004 .00003 .00003
-3.8 .00007 .00007 .00007 .00006 .00006 .00006 .00006 .00005 .00005 .00005
-3.7 .00011 .00010 .00010 .00010 .00009 .00009 .00008 .00008 .00008 .00008
-3.6 .00016 .00015 .00015 .00014 .00014 .00013 .00013 .00012 .00012 .00011
-3.5 .00023 .00022 .00022 .00021 .00020 .00019 .00019 .00018 .00017 .00017
-3.4 .00034 .00032 .00031 .00030 .00029 .00028 .00027 .00026 .00025 .00024
-3.3 .00048 .00047 .00045 .00043 .00042 .00040 .00039 .00038 .00036 .00035
-3.2 .00069 .00066 .00064 .00062 .00060 .00058 .00056 .00054 .00052 .00050
-3.1 .00097 .00094 .00090 .00087 .00084 .00082 .00079 .00076 .00074 .00071
-3.0 .00135 .00131 .00126 .00122 .00118 .00114 .00111 .00107 .00104 .00100
-2.9 .00187 .00181 .00175 .00169 .00164 .00159 .00154 .00149 .00144 .00139
-2.8 .00256 .00248 .00240 .00233 .00226 .00219 .00212 .00205 .00199 .00193
-2.7 .00347 .00336 .00326 .00317 .00307 .00298 .00289 .00280 .00272 .00264
-2.6 .00466 .00453 .00440 .00427 .00415 .00402 .00391 .00379 .00368 .00357
-2.5 .00621 .00604 .00587 .00570 .00554 .00539 .00523 .00508 .00494 .00480
-2.4 .00820 .00798 .00776 .00755 .00734 .00714 .00695 .00676 .00657 .00639
-2.3 .01072 .01044 .01017 .00990 .00964 .00939 .00914 .00889 .00866 .00842
-2.2 .01390 .01355 .01321 .01287 .01255 .01222 .01191 .01160 .01130 .01101
-2.1 .01786 .01743 .01700 .01659 .01618 .01578 .01539 .01500 .01463 .01426
-2.0 .02275 .02222 .02169 .02118 .02068 .02018 .01970 .01923 .01876 .01831
-1.9 .02872 .02807 .02743 .02680 .02619 .02559 .02500 .02442 .02385 .02330
-1.8 .03593 .03515 .03438 .03362 .03288 .03216 .03144 .03074 .03005 .02938
-1.7 .04457 .04363 .04272 .04182 .04093 .04006 .03920 .03836 .03754 .03673
-1.6 .05480 .05370 .05262 .05155 .05050 .04947 .04846 .04746 .04648 .04551
-1.5 .06681 .06552 .06426 .06301 .06178 .06057 .05938 .05821 .05705 .05592
-1.4 .08076 .07927 .07780 .07636 .07493 .07353 .07215 .07078 .06944 .06811
-1.3 .09680 .09510 .09342 .09176 .09012 .08851 .08691 .08534 .08379 .08226
-1.2 .11507 .11314 .11123 .10935 .10749 .10565 .10383 .10204 .10027 .09853
-1.1 .13567 .13350 .13136 .12924 .12714 .12507 .12302 .12100 .11900 .11702
-1.0 .15866 .15625 .15386 .15151 .14917 .14686 .14457 .14231 .14007 .13786
-0.9 .18406 .18141 .17879 .17619 .17361 .17106 .16853 .16602 .16354 .16109
-0.8 .21186 .20897 .20611 .20327 .20045 .19766 .19489 .19215 .18943 .18673
-0.7 .24196 .23885 .23576 .23270 .22965 .22663 .22363 .22065 .21770 .21476
-0.6 .27425 .27093 .26763 .26435 .26109 .25785 .25463 .25143 .24825 .24510
-0.5 .30854 .30503 .30153 .29806 .29460 .29116 .28774 .28434 .28096 .27760
-0.4 .34458 .34090 .33724 .33360 .32997 .32636 .32276 .31918 .31561 .31207
-0.3 .38209 .37828 .37448 .37070 .36693 .36317 .35942 .35569 .35197 .34827
-0.2 .42074 .41683 .41294 .40905 .40517 .40129 .39743 .39358 .38974 .38591
-0.1 .46017 .45620 .45224 .44828 .44433 .44038 .43644 .43251 .42858 .42465
-0.0 .50000 .49601 .49202 .48803 .48405 .48006 .47608 .47210 .46812 .46414
APPENDIX STATS 785
Page 33 of 33
Appendix H: FORMULAE
One-way ANOVA Table
totn total number of observations, g = number of groups
Source df SS MS F
Between groups
Within groups
g – 1
ntot – g
SSB
SSW
MSB =
SSB
MSW =
SSW
𝐹 ,
MSB
MSw
Total ntot – 1 SST
Two-way ANOVA Table
a = levels of Factor A, b = levels of Factor B, n = number of replications
Source df SS MS F
Factor A a – 1 SSA MSA =
SSA 𝐹 ,
𝑀𝑆
𝑀𝑆
Factor B b – 1 SSB MSB =
SSB 𝐹 ,
𝑀𝑆
𝑀𝑆
Interaction (a – 1)(b –
1)
SSAB MSAB =
SSAB 𝐹 ,
𝑀𝑆
𝑀𝑆
Residual ab(n – 1) SSR MSR =
SSR
Total abn – 1 SST
ANOVA for Regression
n = total number of cases, k = number of explanatory variables
Source df SS MS F
Regression k RegSS RegMS =
RegSS
𝐹 ,
RegMS
ResMS
Residual n – k – 1 ResSS ResMS =
ResSS
Total n – 1 TSS
The Chi-square Test
𝑥
observed – expected
expected
all cells in the table
For one way tables: df = J – 1
For two-way tables:
Expected count in cell (i,j) =
i jR C
n
df = (I – 1)( J – 1)