CS计算机代考程序代写 DNA ER algorithm STATS 785

STATS 785

1 of 9

THE UNIVERSITY OF AUCKLAND

SUMMER SCHOOL, 2020

Campus: City

STATISTICS

Statistical Programming and Modelling using SAS

(Time allowed: THREE hours)

NOTE: Attempt all questions. Show all working. For questions that require calculation the majority of

the marks are awarded for correct working. These marks will not be awarded if the working is not

shown, even if the answer is correct.

A separate Appendices Booklet contains SAS code, analyses and formulae.

Read this very carefully.

Total marks: 100.

STATS 785

2 of 9

1. [10 marks] PROC FREQ was applied to births data collected in the National Women’s Hospital

over 1968-1987, where frequencies were recorded with respect to the race of the mother and the

sex of the new-born baby. Some values have been suppressed, denoted by AAAAA, BBBBB,

CCCCC, and DDDDD.

These Tables are shown in APPENDIX A on page 2 of the APPENDICES BOOKLET. Using

these outputs, answer the following:

(a) Compute AAAA. [1 mark]

(b) Compute BBBB. [1 mark]

(d) Compute DDDD. [1 mark]

(e) State the null hypothesis H0 and the alternative hypothesis H1 for the Chi-squared test.

Define any notation you use. [1 mark]

(f) Is it valid to use the Chi-square test result? Explain. [1 mark]

(g) Write down the conclusion (in plain words) that you can draw from the Chi-square result. [1

mark]

(h) Write a SAS program (including the code creating the SAS dataset for the 2 x 6 table) to

produce the above output. [3 marks]

STATS 785

3 of 9

2. [13 marks] One method of measuring genetic damage to an individual’s DNA is to calculate the mean

number of sister chromatid exchanges (MSCE) observed per cell. MSCE levels for 32 individuals were

classified by race.

The SAS code and output are shown in APPENDIX B on pages 3 to 5 of the APPENDICES

BOOKLET. Use this information to answer the following questions.

(a) Compute the missing value AAAAA. [1 mark]

(b) Compute the missing value BBBBB. [1 mark]

(d) Compute the missing value DDDDD. [1 mark]

(e) Write down the null and alternative hypotheses for this model. [1 mark]

(f) Calculate the mean sister chromatid exchange levels in each ethnic group? [1 mark]

(g) What do you conclude from the above output? [1 mark]

(h) Write a Methods and Assumptions check section for this data. [3 marks]

(i) Write an Executive Summary. [3 marks]

STATS 785

4 of 9

3. [12 marks] A randomised experiment was conducted to examine the density of Sydney rock

oysters under various conditions. The density of the oysters was measured for different patch sizes

and different predator control regimes. The SAS code and output to this question is shown in

APPENDIX C on pages 6 to 10 of the APPENDICES BOOKLET.

(a) Write a Methods and Assumption Checks section for the analysis of the Oyster data

model in APPENDIX C of the APPENDICES BOOKLET. [5 marks] Remember to

include the equation of the best model in the analysis.

(b) Write an Executive Summary for the analysis of the Oyster data in APPENDIX C of the

APPENDICES BOOKLET. [7 marks]

4. [12 marks] The following questions involve the analysis of the Baseball Data I detailed in

APPENDIX D on pages 11 to 17.

(a) Comment on the scatter plot of the data. [2 marks]

(b) Write a Methods and Assumption Checks section for the analysis of the Baseball Data I

in APPENDIX D on pages 11 to 17 of the APPENDICES BOOKLET. Remember to

include the equation of the best model in the analysis. [4 marks]

the APPENDICES BOOKLET. [5 marks]

(d) Use your model to estimate the predicted salary for a baseball player with 10 years in the

major league (Show all working). Interpret the prediction interval. How useful do you think

this model will be for prediction? Give two reasons justifying your answer. (You do not

need to include this prediction in the Executive Summary.) [1 mark]

STATS 785

5 of 9

5. [15 marks] The following questions involve the analysis of the Baseball Data II detailed in

APPENDIX E on pages 18 to 27.

(a) Comment on the correlations and pair plots of the data. [2 marks]

(b) Write a Methods and Assumption Checks section for the analysis of the Baseball Data II

in APPENDIX E. Remember to include the equation of the best model in the analysis. [5

marks]

[5 marks]

(d) Use your model to estimate the predicted salary for a baseball player with the following

values. (Show all working).

nRBIs = 50

nRuns = 50

YrMajor = 10

CrHits = 750

(You do not need to include this prediction in the Executive Summary.) [3 marks]

STATS 785

6 of 9

6. [13 marks] The space shuttle solid-fuel rockets have a total of 6 O-rings. It was suspected that O-

ring reliability was influenced by temperature.

The SAS code and output to this question is shown in APPENDIX F on pages 29 to 31 of the

APPENDICES BOOKLET.

We are interested in the proportion (or number of) O-rings that failed and the temperature during

the launch. We wish to quantify what was the influence of the temperature in the probability of

having at least one incident related to the O-rings. Specifically, we want to address the following

questions:

Q1. Is the temperature associated with O-ring incidents?

Q2. In which way was the temperature affecting the probability of O-ring incidents?

Q3. What was the predicted probability of an incident in an O-ring for the temperature of the

launch day?

(a) Write a Methods and Assumption Checks section for the analysis of the Challenger data

model in the analysis in APPENDIX F of the APPENDICES BOOKLET. [5 marks]

Remember to include the equation of the model used in the analysis.

(b) Calculate the predicted probability of an incident in an O-ring at -0.6oC (the temperature of

the launch day) and how many distressed O-rings does this correspond to? Include all

working to both parts of this question. [2 marks]

the APPENDICES BOOKLET. [6 marks]

STATS 785

7 of 9

7. [10 marks] The following data are counts of snapper recorded by a baited underwater video camera

in 1998. There are 55 counts, of which 41 were taken inside the Leigh Marine Reserve and 14 were

taken in areas immediately next to the reserve.

Inside reserve:
0 0 4 0 0 3 0 0 0 0 3 8 1 5
4 1 5 2 1 11 12 4 4 3 6 9 6 6
2 2 5 0 7 2 8 2 2 1 9 1 1

Outside reserve:
0 0 0 0 0 0 1 0 1 0 3 2 2 5

It is desired to make inference about the ratio of snapper density inside the reserve versus outside,

that is , where is the mean density of snapper in the reserve and is the mean density

of snapper outside the reserve. This ratio will be estimated using T(X) = where is the

sample mean of the counts inside the reserve and is the sample mean of the counts outside the

reserve. A SAS program to calculate this is:

DATA parta(KEEP=tx);
*Reserve counts;
ARRAY r {41} (0 0 4 0 0 3 0 0 0 0 3 8 1 5
4 1 5 2 1 11 12 4 4 3 6 9 6 6
2 2 5 0 7 2 8 2 2 1 9 1 1);
*Non-reserve counts;
ARRAY nr {14} (0 0 0 0 0 0 1 0 1 0 3 2 2 5);
tx = MEAN(OF r1-r41)/MEAN(OF nr1-nr14);
RUN;

PROC PRINT NOOBS;
TITLE ‘Observed value of T(X)’;
RUN;

DATA snapper(KEEP=txstar);
*Reserve counts;
ARRAY r {41} (0 0 4 0 0 3 0 0 0 0 3 8 1 5
4 1 5 2 1 11 12 4 4 3 6 9 6 6
2 2 5 0 7 2 8 2 2 1 9 1 1);
*Non-reserve counts;
ARRAY nr {14} (0 0 0 0 0 0 1 0 1 0 3 2 2 5);
ARRAY rstar {41};
ARRAY nrstar {14};
DO iter = 1 TO 100000;
DO i = 1 TO 41;
rstar[i] = r[ ROUND(0.5+41*RANUNI(0)) ];
END;
DO i = 1 TO 14;
nrstar[i] = nr[ ROUND(0.5+14*RANUNI(0)) ];
END;
txstar = MEAN(OF rstar1-rstar41)/MEAN(OF nrstar1-nrstar14);
IF txstar = . THEN txstar = 999999;
OUTPUT;

STATS 785

8 of 9

END;

PROC UNIVARIATE NOPRINT;
OUTPUT OUT=bootsum N=numsamps PCTLPRE=pctl PCTLPTS=2.5 97.5;
TITLE ‘95% percentile confidence interval’;
RUN;

PROC PRINT NOOBS LABEL;
LABEL numsamps = “Number of resamples”;
RUN;

Which produces

Observed value of T(X)

3.41463

95% percentile confidence interval

Number
of

resamples

the
2.5000

percentile,
txstar

the
97.5000

percentile,
txstar

100000 1.70732 11.0634

(a) What is the observed value of T(X). [1 mark]

(b) Provide an approximate 95% CI for using the bootstrap percentile method. [1 mark]

T2(X) = log( ) as an estimator of log( ). [8 marks]

8. [5 marks] Suppose, in the context of testing for a difference between two groups using the

Wilcoxon rank-sum test, that the two samples have five observations each, and the rank-sum

statistic is S = 17.

(a) What is the two-sided p-value for testing the null hypothesis of no difference between the

two groups? [3 marks]

(b) Carry out the normal approximation for the Wilcoxon rank-sum test. Tables of the Normal

Distribution are given in APPENDIX G on Page 32. [2 marks]

STATS 785

9 of 9

9. [10 marks] Twelve subjects were classified into a 2 by 2 contingency table, giving

(a) What are the expected counts under the hypothesis of no association between treatment

and outcome? [2 marks]

(b) List all other possible tables that are consistent with the observed row and column totals.

[2 marks]

assigning the 12 subjects that will result in that table. [4 marks]

(d) What is Fisher’s exact p-value for the hypothesis of no association between treatment and

outcome? [2 marks]

Some factorials:

0! = 1 10! = 3628800 20! = 2.4329020 × 1018

1! = 1 11! = 3991680 21! = 5.1090942 × 1019
2! = 2 12! = 4.7900160 × 108 22! = 1.1240007 × 1021
3! = 6 13! = 6.2270208 × 109 23! = 2.5852017 × 1022
4! = 24 14! = 8.7178291 × 1010 24! = 6.2044840 × 1023
5! = 120 15! = 1.3076744 × 1012 25! = 1.5511210 × 1025
6! = 720 16! = 2.0922790 × 1013 26! = 4.0329146 × 1026
7! = 5040 17! = 3.5568743 × 1014 27! = 1.0888869 × 1028
8! = 40320 18! = 6.4023737 × 1015 28! = 3.0488834 × 1029
9! = 362880 19! = 1.2164510 × 1017 29! = 8.8417620 × 1030

APPENDIX STATS 785

Page 1 of 33

THE UNIVERSITY OF AUCKLAND

SUMMER SCHOOL, 2020

Campus: City

STATISTICS

Statistical Programming and Modelling using SAS

APPENDICES BOOKLET

CONTENTS

 APPENDIX A: Race of mother and sex of baby page 2

 APPENDIX B: Genetic Damage data pages 3 to 5

 APPENDIX C: Oyster data pages 6 to 10

 APPENDIX D: Baseball data I pages 11 to 17

 APPENDIX E: Baseball data II pages 18 to 27

 APPENDIX F: Space Challenger data pages 28 to 31

 APPENDIX G: Standard Normal Distribution page 32

 APPENDIX H: Formulae page 33

APPENDIX STATS 785

Page 2 of 33

Appendix A: Race of mother and sex of baby

PROC FREQ was applied to births data collected in the National Women’s

Hospital over 1968-1987, where frequencies were recorded with respect to the race

of the mother and the sex of the newborn baby. Some values have been suppressed,

denoted by AAAAA, BBBBB, CCCCC, and DDDDD.

The variables are:

Race Ethnic group of mother.
sex Gender (Male or Female).

Table of sex by race

sex race

Frequency
Expected
Cell Chi-Square Chinese European Indian Maori Other PacificIsland Total

Females AAAAA
430.81
0.7685

30848
BBBBB

0.0737

531
494.82

CCCCC

6674
6644.5
0.1309

361
361.01
271E-9

5521
5557.1
0.2351

44384

Males 446
464.19
0.7132

33338
33290
0.0684

497
533.18
2.4545

7130
7159.5
0.1215

389
388.99
251E-9

6024
5987.9
0.2182

47824

Total 895 64186 1028 13804 750 11545 92208

Statistic DF Value Prob

Chi-Square 5 DDDDD 0.1907

Likelihood Ratio Chi-Square 5 7.4229 0.1910

Mantel-Haenszel Chi-Square 1 0.0987 0.7534

Phi Coefficient 0.0090

Contingency Coefficient 0.0090

Cramer’s V 0.0090

APPENDIX STATS 785

Page 3 of 33

Appendix B. Genetic Damage data

One method of measuring genetic damage to an individual’s DNA is to calculate

the mean number of sister chromatid exchanges (MSCE) observed per cell.

MSCE levels for 32 individuals were classified by race.

The variables measured were:

Race Asian, Black, Caucasian and Native American
MSCE the mean number of sister chromatid exchanges per cell.

The SAS code and output follow:

PROC FORMAT;
VALUE racefmt 1 = ‘Black’
2 = ‘Caucasian’
3 = ‘Native American’
4 = ‘Asian’;
RUN;

PROC GLM PLOTS=DIAGNOSTICS(UNPACK);
CLASS race;
MODEL MSCE = race / SOLUTION;
LSMEANS race / STDERR PDIFF;
FORMAT race racefmt.;
RUN;

The GLM Procedure

Class Level Information

Class Levels Values

race 4 Asian Black Caucasian Native American

Number of Observations Read 32

Number of Observations Used 32

Source DF
Sum of

Squares Mean Square F Value Pr > F

Model 3 2.13604013 AAAAA CCCCC 0.1801

Error 28 11.40830675 BBBBB

Corrected Total 31 DDDDD

APPENDIX STATS 785

Page 4 of 33

R-Square Coeff Var Root MSE MSCE Mean

0.157707 7.591866 0.638310 8.407813

Source DF Type I SS Mean Square F Value Pr > F

race 3 2.13604013 0.71201338 1.75 0.1801

Source DF Type III SS Mean Square F Value Pr > F

race 3 2.13604013 0.71201338 1.75 0.1801

Parameter Estimate
Standard

Error t Value Pr > |t|

Intercept 8.572857143 B 0.24125846 35.53 <.0001 race Asian 0.185892857 B 0.33035676 0.56 0.5781 race Black -0.349107143 B 0.33035676 -1.06 0.2997 race Caucasian -0.441746032 B 0.32167795 -1.37 0.1806 race Native American 0.000000000 B . . . Note: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates followed by the letter 'B' are not uniquely estimable. APPENDIX STATS 785 Page 5 of 33 race MSCE LSMEAN Standard Error Pr > |t|
LSMEAN

Number

Asian Answered in 4f 0.22567663 <.0001 1 Black Answered in 4f 0.22567663 <.0001 2 Caucasian Answered in 4f 0.21276997 <.0001 3 Native American Answered in 4f 0.24125846 <.0001 4 Least Squares Means for effect race Pr > |t| for H0: LSMean(i)=LSMean(j)

Dependent Variable: MSCE

i/j 1 2 3 4

1 0.1048 0.0527 0.5781

2 0.1048 0.7674 0.2997

3 0.0527 0.7674 0.1806

4 0.5781 0.2997 0.1806

APPENDIX STATS 785

Page 6 of 33

Appendix C. Oyster data

A randomised experiment was conducted to examine the density of Sydney rock oysters

under various conditions. The density of the oysters was measured for different patch

sizes and different predator control regimes.

The variables measured were:

 Oysters: the number of oysters per 1010 cm area

 Size: the size of the patches: small, medium or large

 Cage: the different predator control regimes: full, partial or open.

The SAS code and output follows:

PROC BOXPLOT;
PLOT oysters*cell;
RUN;

Where fulllar = fulllarge, partlar = partiallarge, openlar =

openlarge, fullmed = fullmedium, partmed = partialmedium, openmed =

openmedium, fullsma = fullsmall, partsma = partialsmall, opensma =

opensmall.

APPENDIX STATS 785

Page 7 of 33

PROC GLM PLOTS=ALL;
CLASS cage size;
MODEL oysters=cage|size;
RUN;

The GLM Procedure
Dependent Variable: oysters

Class Level Information

Class Levels Values

cage 3 full open part

size 3 lar med sma

Number of Observations Read 54

Number of Observations Used 54

Source DF
Sum of

Squares Mean Square F Value Pr > F

Model 8 2224.634259 278.079282 9.75 <.0001 Error 45 1283.010417 28.511343 Corrected Total 53 3507.644676 R-Square Coeff Var Root MSE oysters Mean 0.634225 36.41787 5.339601 14.66204 Source DF Type I SS Mean Square F Value Pr > F

cage 2 1986.516204 993.258102 34.84 <.0001 size 2 14.113426 7.056713 0.25 0.7818 cage*size 4 224.004630 56.001157 1.96 0.1163 APPENDIX STATS 785 Page 8 of 33 PROC GLM PLOTS=ALL; CLASS cage size; MODEL oysters = cage size; LSMEANS cage size / ADJUST=TUKEY PDIFF STDERR CL; RUN; The GLM Procedure Dependent Variable: oysters Source DF Sum of Squares Mean Square F Value Pr > F

Model 4 2000.629630 500.157407 16.26 <.0001 Error 49 1507.015046 30.755409 Corrected Total 53 3507.644676 APPENDIX STATS 785 Page 9 of 33 R-Square Coeff Var Root MSE oysters Mean 0.570363 37.82391 5.545756 14.66204 Source DF Type I SS Mean Square F Value Pr > F

cage 2 1986.516204 993.258102 32.30 <.0001 size 2 14.113426 7.056713 0.23 0.7958 Source DF Type III SS Mean Square F Value Pr > F

cage 2 1986.516204 993.258102 32.30 <.0001 size 2 14.113426 7.056713 0.23 0.7958 The GLM Procedure Least Squares Means Adjustment for Multiple Comparisons: Tukey cage oysters LSMEAN Standard Error Pr > |t|
LSMEAN

Number

full 22.9722222 1.3071472 <.0001 1 open 8.6666667 1.3071472 <.0001 2 part 12.3472222 1.3071472 <.0001 3 APPENDIX STATS 785 Page 10 of 33 Least Squares Means for effect cage Pr > |t| for H0: LSMean(i)=LSMean(j)

Dependent Variable: oysters

i/j 1 2 3

1 <.0001 <.0001 2 <.0001 0.1252 3 <.0001 0.1252 cage oysters LSMEAN 95% Confidence Limits full 22.972222 20.345412 25.599033 open 8.666667 6.039856 11.293477 part 12.347222 9.720412 14.974033 Least Squares Means for Effect cage i j Difference Between Means Simultaneous 95% Confidence Limits for LSMean(i)-LSMean(j) 1 2 14.305556 9.837738 18.773373 1 3 10.625000 6.157182 15.092818 2 3 -3.680556 -8.148373 0.787262 APPENDIX STATS 785 Page 11 of 33 APPENDIX D: Baseball Salaries data I The data set contains salary and performance information for Major League Baseball players who played at least one game in both the 1986 and 1987 seasons, excluding pitchers. The salaries are for the 1987 season and the performance measures are from 1986. The variables are: Salary: 1987 Salary in $ Thousands YrMajor Years in the Major league The aim of this question of to determine whether the number of years in the major league has any influence on the salary of the baseball player. PROC LOESS PLOTS DATA=ass2; MODEL Salary = YrMajor / SMOOTH=0.67 CLM; RUN; PROC REG DATA=exam PLOTS=DIAGNOSTICS(UNPACK); MODEL Salary = yrmajor; RUN; APPENDIX STATS 785 Page 12 of 33 The REG Procedure Model: MODEL1 Dependent Variable: Salary 1987 Salary in $ Thousands Number of Observations Read 322 Number of Observations Used 263 Number of Observations with Missing Values 59 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F

Model 1 10458197 10458197 63.68 <.0001 Error 261 42860916 164218 Corrected Total 262 53319113 APPENDIX STATS 785 Page 13 of 33 Root MSE 405.23828 R-Square 0.1961 Dependent Mean 535.92588 Adj R-Sq 0.1931 Coeff Var 75.61461 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|

Intercept Intercept 1 227.65284 46.00686 4.95 <.0001 YrMajor Years in the Major Leagues 1 41.70566 5.22609 7.98 <.0001 DATA exam; SET exam; LogSalary = LOG(Salary); RUN; PROC LOESS PLOTS DATA=ass2a; MODEL LogSalary = YrMajor / SMOOTH=0.67 CLM; RUN; PROC REG DATA= exam PLOTS=DIAGNOSTICS(UNPACK); MODEL LogSalary = yrmajor; OUTPUT OUT=resid R=residual P=predicted; RUN; PROC LOESS PLOTS DATA=resid; MODEL residual = predicted / SMOOTH=0.67 CLM; RUN; APPENDIX STATS 785 Page 14 of 33 DATA exam; SET exam; YrMajor2 = YrMajor**2; RUN; PROC REG DATA=ass2a PLOTS(LABEL)=DIAGNOSTICS(UNPACK); ID name team league; MODEL LogSalary = YrMajor YrMajor2; OUTPUT OUT=resid R=residual P=predicted; RUN; PROC LOESS PLOTS DATA=resid; MODEL residual = predicted / SMOOTH=0.67 CLM; RUN; APPENDIX STATS 785 Page 15 of 33 The REG Procedure Model: MODEL1 Dependent Variable: logSalary Log Salary Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F

Model 2 106.41383 53.20692 137.32 <.0001 Error 260 100.73990 0.38746 Corrected Total 262 207.15373 Root MSE 0.62246 R-Square 0.5137 Dependent Mean 5.92722 Adj R-Sq 0.5100 Coeff Var 10.50178 APPENDIX STATS 785 Page 16 of 33 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|

Intercept Intercept 1 4.24333 0.11368 37.33 <.0001 YrMajor Years in the Major Leagues 1 0.38788 0.02885 13.44 <.0001 YrMajor2 1 -0.01527 0.00149 -10.22 <.0001 PROC REG DATA=exam; PLOTS=DIAGNOSTICS(UNPACK)PLOTS=PREDICTIONS(X=YrMajor); MODEL LogSalary = YrMajor YrMajor2; WHERE name^="Rose, Pete"; RUN; DATA add; INPUT YrMajor YrMajor2 name $; CARDS; 10 100 Pred ; RUN; DATA both; SET exam add; RUN; Predictions and Residuals for logSalary 95% Prediction Limits95% Confidence LimitsFit -2 -1 0 1 2 R es id u al 4 5 6 7 8 L o g S al ar y 0 5 10 15 20 Years in the Major Leagues APPENDIX STATS 785 Page 17 of 33 PROC REG DATA=both; MODEL LogSalary = YrMajor YrMajor2; OUTPUT OUT=resid R=residual P=predicted ucl=CLUpper lcl = CLLower; WHERE name^="Rose, Pete"; RUN; DATA resid2; SET resid; IF name = 'Pred'; PredSal = EXP(Predicted); lcl=EXP(CLLower); ucl = EXP(CLUpper); RUN; PROC PRINT DATA=resid2; VAR YrMajor YrMajor2 Predicted CLLower CLUpper PredSal lcl ucl; WHERE name = 'Pred'; RUN; Obs YrMajor YrMajor2 predicted CLLower CLUpper PredSal lcl ucl 1 10 100 6.63913 5.43438 7.84389 764.432 229.150 2550.11 APPENDIX STATS 785 Page 18 of 33 APPENDIX E: Baseball Salaries Data II The data set contains salary and performance information for Major League Baseball players who played at least one game in both the 1986 and 1987 seasons, excluding pitchers. The salaries are for the 1987 season and the performance measures are from 1986. The variables are: Salary: 1987 Salary in $ Thousands YrMajor Years in the Major league nhits Hits in 1986 nruns Runs in 1986 nrbi RBIs in 1986 (Runs Batted In) nbb Walks in 1986 crhits Career hits The aim of this question of to determine which variables have any influence on the salary of the baseball player. PROC CORR DATA=ass2; VAR salary nhits nruns nrbi nbb yrmajor crhits; RUN; PROC SGSCATTER DATA=ass2; MATRIX Salary nhits nruns nrbi nbb yrmajor crhits / DIAGONAL=(HISTOGRAM); RUN; APPENDIX STATS 785 Page 19 of 33 Pearson Correlation Coefficients, N = 263 Prob > |r| under H0: Rho=0

Salary nHits nRuns nRBI nBB YrMajor CrHits

Salary
1987 Salary in $ Thousands

1.00000

0.50136
<.0001 0.47903 <.0001 0.51723 <.0001 0.50462 <.0001 0.44288 <.0001 0.59221 <.0001 nHits Hits in 1986 0.50136 <.0001 1.00000 0.90571 <.0001 0.77829 <.0001 0.57116 <.0001 -0.00097 0.9875 0.22217 0.0003 nRuns Runs in 1986 0.47903 <.0001 0.90571 <.0001 1.00000 0.77059 <.0001 0.68775 <.0001 -0.02846 0.6460 0.17966 0.0035 nRBI RBIs in 1986 0.51723 <.0001 0.77829 <.0001 0.77059 <.0001 1.00000 0.56181 <.0001 0.12460 0.0435 0.29005 <.0001 nBB Walks in 1986 0.50462 <.0001 0.57116 <.0001 0.68775 <.0001 0.56181 <.0001 1.00000 0.12735 0.0390 0.26615 <.0001 YrMajor Years in the Major Leagues 0.44288 <.0001 -0.00097 0.9875 -0.02846 0.6460 0.12460 0.0435 0.12735 0.0390 1.00000 0.89803 <.0001 CrHits Career Hits 0.59221 <.0001 0.22217 0.0003 0.17966 0.0035 0.29005 <.0001 0.26615 <.0001 0.89803 <.0001 1.00000 APPENDIX STATS 785 Page 20 of 33 ODS GRAPHICS ON; PROC REG DATA=ass2a PLOTS(LABEL)=ALL;; MODEL logsalary = nhits nruns nrbi nbb yrmajor crhits; ID name team league; RUN; ODS GRAPHICS OFF; Career HitsYears in th...Walks in 1...RBIs in 19...Runs in 19...Hits in 19861987 Salar... C ar ee r H its Y ea rs in t h. .. W al ks in 1 .. . R B Is in 1 9. .. R un s in 1 9. .. H its in 1 98 6 19 87 S al ar .. . APPENDIX STATS 785 Page 21 of 33 APPENDIX STATS 785 Page 22 of 33 DATA ass2a; SET ass2(WHERE=(name^="Rose, Pete")); yrMajor2 = YrMajor**2; CrHits2 = CrHits**2; RUN; ODS GRAPHICS ON; PROC REG DATA=ass2a PLOTS(LABEL)=(ALL RESIDUALS(SMOOTH)); MODEL logsalary = nhits nruns nrbi nbb yrmajor yrmajr2 crhits crhits2; ID name; RUN; ODS GRAPHICS OFF; Root MSE 0.41673 R-Square 0.7874 Dependent Mean 5.92458 Adj R-Sq 0.7807 Coeff Var 7.03393 APPENDIX STATS 785 Page 23 of 33 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|

Intercept Intercept 1 3.78922 0.11581 32.72 <.0001 nHits Hits in 1986 1 -0.00012753 0.00159 -0.08 0.9363 nRuns Runs in 1986 1 0.00215 0.00288 0.75 0.4549 nRBI RBIs in 1986 1 0.00431 0.00172 2.51 0.0127 nBB Walks in 1986 1 0.00501 0.00173 2.90 0.0040 YrMajor Years in the Major Leagues 1 0.23908 0.03443 6.94 <.0001 yrMajor2 1 -0.01440 0.00165 -8.73 <.0001 CrHits Career Hits 1 0.00170 0.00027562 6.18 <.0001 CrHits2 1 -3.31739E-7 1.001272E-7 -3.31 0.0011 Fit Diagnostics for logSalary 0.5316Adj R-Square 0.5352R-Square 0.3709MSE 259Error DF 3Parameters 262Observations Proportion Less 0.0 0.4 0.8 Residual 0.0 0.4 0.8 Fit–Mean -2 -1 0 1 2 -2 -1.2 -0.4 0.4 1.2 2 Residual 0 5 10 15 20 25 P er ce n t 0 100 200 300 Observation 0.00 0.02 0.04 0.06 C o o k' s D 4 5 6 7 8 Predicted Value 4 5 6 7 8 L o g S al ar y -3 -2 -1 0 1 2 3 Quantile -2 -1 0 1 2 R es id u al 0.00 0.04 0.08 Leverage -4 -2 0 2 R S tu d en t 4.5 5.0 5.5 6.0 6.5 Predicted Value -4 -2 0 2 R S tu d en t 4.5 5.0 5.5 6.0 6.5 Predicted Value -2 -1 0 1 2 R es id u al APPENDIX STATS 785 Page 24 of 33 PROC REG DATA=ass2a; MODEL logsalary = nruns nrbi nbb yrmajor yrmajor2 crhits crhits2; RUN; Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|

Intercept Intercept 1 3.78600 0.10840 34.93 <.0001 nRuns Runs in 1986 1 0.00199 0.00205 0.97 0.3325 nRBI RBIs in 1986 1 0.00427 0.00163 2.62 0.0092 nBB Walks in 1986 1 0.00504 0.00168 3.00 0.0030 YrMajor Years in the Major Leagues 1 0.23940 0.03413 7.01 <.0001 yrMajor2 1 -0.01440 0.00165 -8.74 <.0001 CrHits Career Hits 1 0.00170 0.00026220 6.47 <.0001 CrHits2 1 -3.30148E-7 9.794245E-8 -3.37 0.0009 PROC REG DATA=ass2a; MODEL logsalary = nrbi nbb yrmajor yrmajor2 crhits crhits2; RUN; Root MSE 0.41587 R-Square 0.7866 Dependent Mean 5.92458 Adj R-Sq 0.7816 Coeff Var 7.01937 APPENDIX STATS 785 Page 25 of 33 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|

Intercept Intercept 1 3.82206 0.10183 37.53 <.0001 nRBI RBIs in 1986 1 0.00522 0.00130 4.03 <.0001 nBB Walks in 1986 1 0.00582 0.00148 3.92 0.0001 YrMajor Years in the Major Leagues 1 0.23188 0.03324 6.98 <.0001 yrMajor2 1 -0.01431 0.00164 -8.70 <.0001 CrHits Career Hits 1 0.00178 0.00024700 7.21 <.0001 CrHits2 1 -3.53159E-7 9.50212E-8 -3.72 0.0002 PROC REG DATA=ass2a; MODEL logsalary = nrbi nbb yrmajor crhits / VIF; RUN; Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Variance
Inflation

Intercept Intercept 1 4.54564 0.12004 37.87 <.0001 0 nRBI RBIs in 1986 1 0.00748 0.00184 4.07 <.0001 1.64693 nBB Walks in 1986 1 0.00869 0.00211 4.12 <.0001 1.51646 YrMajor Years in the Major Leagues 1 0.02663 0.01944 1.37 0.1720 6.11568 CrHits Career Hits 1 0.00059538 0.00015734 3.78 0.0002 6.83256 %LET ind = nrbi nbb yrmajor yrmajor2 crhits crhits2; PROC REG DATA=ass2a PLOTS=NONE; MODEL logsalary = &ind; OUTPUT OUT=pred P=p; RUN; %marginal(dependent=logSalary, predicted=p, independents=&ind) APPENDIX STATS 785 Page 26 of 33 Marginal Models for logSalary Model Data Predicted Values CrHits2Career HitsyrMajor2 Years in the Major LeaguesWalks in 1986RBIs in 1986 APPENDIX STATS 785 Page 27 of 33 DATA ADD; INPUT nrbi nbb yrmajor crhits; yrmajor2 = yrmajor**2; crhits2 = crhits**2; CARDS; 50 50 10 750 ; RUN; DATA both; SET ass2a add; RUN; PROC REG DATA=both; MODEL logsalary = nrbi nbb yrmajor yrmajor2 crhits crhits2; OUTPUT OUT=resid R=residual P=predicted ucl=CLUpper lcl=CLLower; RUN; DATA resid1; SET resid; est= EXP(predicted); lcl = EXP(CLLower); ucl = EXP(CLUpper); RUN; PROC PRINT; VAR nrbi nbb yrmajor crhits Salary est lcl ucl; WHERE nhits = .; RUN; Obs nRBI nBB YrMajor CrHits Salary est lcl ucl 322 50 50 10 750 . 601.539 263.690 1372.25 APPENDIX STATS 785 Page 28 of 33 Appendix F: Challenger Data The space shuttle solid-fuel rockets have a total of 6 O-rings. It was suspected that O-ring reliability was influenced by temperature. The variables of interest were: • fail_field: the number in the O-rings that failed. • total: the total number of O-rings (= 6 here). The SAS code and output follows: SYMBOL1 VALUE=CIRCLE; PROC GPLOT; PLOT propn*temp; RUN; PROC GENMOD DATA=challenger PLOTS=RESDEV; MODEL fail_field/total = temp / DIST=B; RUN; APPENDIX STATS 785 Page 29 of 33 The GENMOD Procedure Model Information Data Set WORK.CHALLENGER Distribution Binomial Link Function Logit Response Variable (Events) fail_field Response Variable (Trials) total Number of Observations Read 23 Number of Observations Used 23 Number of Events 10 Number of Trials 138 Response Profile Ordered Value Binary Outcome Total Frequency 1 Event 10 2 Nonevent 128 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 21 19.1362 0.9112 Scaled Deviance 21 19.1362 0.9112 Pearson Chi-Square 21 34.3574 1.6361 Scaled Pearson X2 21 34.3574 1.6361 Log Likelihood -31.0629 Full Log Likelihood -16.4003 AIC (smaller is better) 36.8007 AICC (smaller is better) 37.4007 BIC (smaller is better) 39.0717 Algorithm converged. APPENDIX STATS 785 Page 30 of 33 Analysis Of Maximum Likelihood Parameter Estimates Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi- Square Pr > ChiSq

Intercept 1 2.2782 1.5160 -0.6931 5.2494 2.26 0.1329

temp 1 -0.2514 0.0838 -0.4157 -0.0871 8.99 0.0027

Scale 0 1.0000 0.0000 1.0000 1.0000

DATA temp;
p = 1 – PROBCHI(19.1632, 21);
RUN;
PROC PRINT;
RUN;

Obs p

1 0.57467

APPENDIX STATS 785

Page 31 of 33

DATA add;
INPUT temp;
CARDS;
-0.6
;
DATA both;
SET challenger add;
RUN;
PROC GENMOD DATA=both;
MODEL fail_field/total = temp / DIST=B;
OUTPUT OUT=pred P = pred lower=lower upper=upper;
RUN;
PROC PRINT DATA=pred;
VAR pred lower upper;
WHERE propn = .;
RUN;

Obs pred lower upper

24 0.91901 0.34562 0.99592

APPENDIX STATS 785

Page 32 of 33

STANDARD NORMAL DISTRIBUTION: Table Values Represent AREA to the LEFT of the Z score.

Appendix G: Standard Normal Distribution

Z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

-3.9 .00005 .00005 .00004 .00004 .00004 .00004 .00004 .00004 .00003 .00003
-3.8 .00007 .00007 .00007 .00006 .00006 .00006 .00006 .00005 .00005 .00005
-3.7 .00011 .00010 .00010 .00010 .00009 .00009 .00008 .00008 .00008 .00008
-3.6 .00016 .00015 .00015 .00014 .00014 .00013 .00013 .00012 .00012 .00011
-3.5 .00023 .00022 .00022 .00021 .00020 .00019 .00019 .00018 .00017 .00017

-3.4 .00034 .00032 .00031 .00030 .00029 .00028 .00027 .00026 .00025 .00024
-3.3 .00048 .00047 .00045 .00043 .00042 .00040 .00039 .00038 .00036 .00035
-3.2 .00069 .00066 .00064 .00062 .00060 .00058 .00056 .00054 .00052 .00050
-3.1 .00097 .00094 .00090 .00087 .00084 .00082 .00079 .00076 .00074 .00071
-3.0 .00135 .00131 .00126 .00122 .00118 .00114 .00111 .00107 .00104 .00100

-2.9 .00187 .00181 .00175 .00169 .00164 .00159 .00154 .00149 .00144 .00139
-2.8 .00256 .00248 .00240 .00233 .00226 .00219 .00212 .00205 .00199 .00193
-2.7 .00347 .00336 .00326 .00317 .00307 .00298 .00289 .00280 .00272 .00264
-2.6 .00466 .00453 .00440 .00427 .00415 .00402 .00391 .00379 .00368 .00357
-2.5 .00621 .00604 .00587 .00570 .00554 .00539 .00523 .00508 .00494 .00480

-2.4 .00820 .00798 .00776 .00755 .00734 .00714 .00695 .00676 .00657 .00639
-2.3 .01072 .01044 .01017 .00990 .00964 .00939 .00914 .00889 .00866 .00842
-2.2 .01390 .01355 .01321 .01287 .01255 .01222 .01191 .01160 .01130 .01101
-2.1 .01786 .01743 .01700 .01659 .01618 .01578 .01539 .01500 .01463 .01426
-2.0 .02275 .02222 .02169 .02118 .02068 .02018 .01970 .01923 .01876 .01831

-1.9 .02872 .02807 .02743 .02680 .02619 .02559 .02500 .02442 .02385 .02330
-1.8 .03593 .03515 .03438 .03362 .03288 .03216 .03144 .03074 .03005 .02938
-1.7 .04457 .04363 .04272 .04182 .04093 .04006 .03920 .03836 .03754 .03673
-1.6 .05480 .05370 .05262 .05155 .05050 .04947 .04846 .04746 .04648 .04551
-1.5 .06681 .06552 .06426 .06301 .06178 .06057 .05938 .05821 .05705 .05592

-1.4 .08076 .07927 .07780 .07636 .07493 .07353 .07215 .07078 .06944 .06811
-1.3 .09680 .09510 .09342 .09176 .09012 .08851 .08691 .08534 .08379 .08226
-1.2 .11507 .11314 .11123 .10935 .10749 .10565 .10383 .10204 .10027 .09853
-1.1 .13567 .13350 .13136 .12924 .12714 .12507 .12302 .12100 .11900 .11702
-1.0 .15866 .15625 .15386 .15151 .14917 .14686 .14457 .14231 .14007 .13786

-0.9 .18406 .18141 .17879 .17619 .17361 .17106 .16853 .16602 .16354 .16109
-0.8 .21186 .20897 .20611 .20327 .20045 .19766 .19489 .19215 .18943 .18673
-0.7 .24196 .23885 .23576 .23270 .22965 .22663 .22363 .22065 .21770 .21476
-0.6 .27425 .27093 .26763 .26435 .26109 .25785 .25463 .25143 .24825 .24510
-0.5 .30854 .30503 .30153 .29806 .29460 .29116 .28774 .28434 .28096 .27760

-0.4 .34458 .34090 .33724 .33360 .32997 .32636 .32276 .31918 .31561 .31207
-0.3 .38209 .37828 .37448 .37070 .36693 .36317 .35942 .35569 .35197 .34827
-0.2 .42074 .41683 .41294 .40905 .40517 .40129 .39743 .39358 .38974 .38591
-0.1 .46017 .45620 .45224 .44828 .44433 .44038 .43644 .43251 .42858 .42465
-0.0 .50000 .49601 .49202 .48803 .48405 .48006 .47608 .47210 .46812 .46414

APPENDIX STATS 785

Page 33 of 33

Appendix H: FORMULAE

One-way ANOVA Table

totn total number of observations, g = number of groups
Source df SS MS F
Between groups
Within groups

g – 1
ntot – g

SSB
SSW

MSB =
SSB

MSW =
SSW

𝐹 ,
MSB
MSw

Total ntot – 1 SST

Two-way ANOVA Table
a = levels of Factor A, b = levels of Factor B, n = number of replications

Source df SS MS F
Factor A a – 1 SSA MSA =

SSA 𝐹 ,
𝑀𝑆
𝑀𝑆

Factor B b – 1 SSB MSB =
SSB 𝐹 ,

𝑀𝑆
𝑀𝑆

Interaction (a – 1)(b –
1)

SSAB MSAB =
SSAB 𝐹 ,

𝑀𝑆
𝑀𝑆

Residual ab(n – 1) SSR MSR =
SSR

Total abn – 1 SST

ANOVA for Regression

n = total number of cases, k = number of explanatory variables

Source df SS MS F
Regression k RegSS RegMS =

RegSS
𝐹 ,

RegMS
ResMS

Residual n – k – 1 ResSS ResMS =
ResSS

Total n – 1 TSS

The Chi-square Test

𝑥
observed – expected

expected
all cells in the table

For one way tables: df = J – 1

For two-way tables:

Expected count in cell (i,j) =
i jR C

n
df = (I – 1)( J – 1)

Related Posts