Take-Home Final Exam for ISyE 7406
This is an open-book take-home final exam. You are free to use any recourses including textbooks,
notes, computers and internet, but no collaborations are allowed, particularly you cannot commu-
Copyright By PowCoder代写 加微信 powcoder
nicate, online or orally, with any other people about this midterm (except the TAs or instructor via
piazza if you have any questions or concerns). This must be individual work. See Canvas for the
data sets and some useful R codes.
Overview: In probability and statistics, it is important to understand the mean and variance for
any random variables. In many applications, it is straightforward to simulate the random variable Y ’s,
but it is often highly non-trivial to characterize the exact distribution of Y = Y (X1, X2) including
deriving the explicit formulas for the mean and variance of Y = Y (X1, X2) explicitly as a function of
X1 and X2.
Objective: In this exam, suppose that Y = Y (X1, X2) is a random variable whose distribution
depends on two independent variables X1 and X2, and the objective is to estimate two deterministic
functions of X1 and X2: one is the mean µ(X1, X2) = E(Y ) and the other is the variance V (X1, X2) =
For that purpose, you are provided the observed 200 realizations of the Y ’s values for some given
pairs (X1, X2)’s. You are asked to use data mining or machine learning methods that allow us to
conveniently predict or approximate the mean and variance of Y = Y (X1, X2) as a function of X1
and X2. That is, your task is to predict or approximate two values for those given pairs (X1, X2)
in the testing data set: one for the mean µ(X1, X2) = E(Y (X1, X2)) and the other for the variance
V (X1, X2) = V ar(Y (X1, X2)).
Training data set: In order to help you to develop a reasonable estimation of the mean and
variance of Y = Y (X1, X2) as deterministic functions of X1 and X2, we provide a training data
set that is generated as follows. We first choose the uniform design points when 0 ≤ X1 ≤ 1 and
0 ≤ X2 ≤ 1, that is, x1i = 0.01 ∗ i for i = 0, 1, 2, . . . , 99, and x2j = 0.01 ∗ j for j = 0, 1, 2, . . . , 99. Thus
there are a total of 100 ∗ 100 = 104 combinations of (x1i, x2j)’s, and for each of these 104 combinations,
we generate 200 independent realizations of the Y variables, denoted by Yijk for k = 1, . . . , 200.
The corresponding training data, 2022Fall7406train.csv, is available from Canvas. Note that
this training data set is a 104×202 table. Each row corresponds to one of 100∗100 = 104 combinations
of (X1, X2)’s. The first and second columns are the X1 and X2 values, respectively, whereas the
remaining 200 columns are the corresponding 200 independent realizations of Y ’s.
Based on the training data, you are asked to develop an accurate estimation of the functions
µ(X1, X2) = E(Y ) and V (X1, X2) = V ar(Y ), as deterministic functions ofX1 andX2 when 0 ≤ X1 ≤ 1
and 0 ≤ X2 ≤ 1.
To assist you, a limited empirical data analysis (EDA) on the training data is provided in the
appendix by using R. Please feel free to modify to other language such as Python, Matlab, etc.
Testing data set: For the purpose of evaluating your proposed estimation models and methods,
we choose 50 random design points for X1 and 50 random design points for X2. Thus there are a total
of 50 ∗ 50 = 2500 combinations of (X1, X2) in the testing data set. You are asked to use your formula
to predict µ(X1, X2) = E(Y ) and V (X1, X2) = V ar(Y ) for Y = Y (X1, X2) for the 50 ∗ 50 = 2500
combination of (X1, X2) in the testing data (please keep the six digits for your answers).
The exact values of the (X1, X2)’s in the testing data set are included in the file 2022Fall7406test.csv,
which is available from Canvas. You are asked to use your formula to predict µ(X1, X2) = E(Y ) and
V (X1, X2) = V ar(Y ) for the 50 ∗ 50 = 2500 combination of (X1, X2) in the testing data (please keep
(at least) six digits for your answers).
Estimation Evaluation Criterion: In order to evaluate your estimation or prediction, we obtain
“true” values µ(X1, X2) = E(Y ) and V (X1, X2) = V ar(Y ) for each combination of (X1, X2) in the
testing data set, based on the following Monte Carlo simulations (we will not release these true values!).
We first generated 200 random realizations of Y ’s for each combination of (X1, X2) in the testing
data set, but we will not release these 200 independent realizations for the testing data. Next, for each
given combination of (X1, X2), we have 200 realizations of Y ’s, denoted by Y1, · · · , Y200, and then we
compute the “true” values as
µ∗true = Ȳ =
Y1 + · · ·+ Y200
and V ∗true =
ˆV ar(Y ) =
(Yi − Ȳ )2.
Your predicted mean or variance functions, say, µ̂(X1, X2) and V̂ (X1, X2), will then be evaluated
as compared to these true values, µ∗true(X1, X2) and V
true(X1, X2):
(µ̂(x1i, x2j)− µ∗true(x1i, x2j))
(V̂ (x1i, x2j)− V ∗true(x1i, x2j))
where (I, J) = (50, 50) for the testing data.
Your tasks: as your solution set to this exam, you are required to submit two files to Canvas
before the deadline:
(a) A .csv file on the required prediction that includes your predicted values for µ(X1, X2) = E(Y )
and V (X1, X2) = V ar(Y ) for the testing data (in 6 digits). Please name your file as
“1.YourLastName.YourFirst.Name.csv”, e.g., “1.Mei.Yajun.csv” for the name of the instructor.
I think students in our class have a unique combination of last/first name, and thus there is no
need to include the middle name.
• The submitted csv file in excel must be 2500 × 4 column, and the first two columns must
be the exact same as the provided testing data file “2022Fall7406test.csv”. The third
column should be your estimated mean µ̂(X1, X2), and the fourth column is your estimated
variance V̂ (X1, X2).
• If you want, please round your numerical answers to the six decimal places, e.g., report your
estimations as the form of 30.xxxxxx, but this is optional: in our evaluation process we will
use the round function to round your answers to the six decimal before computing MSE.
• Please save your predictions as a 2500*4 data matrix in this .csv file, e.g., without headers
or row/column labels/names. We will use the computer to auto-read your .csv file and
then auto-compute the MSE values in equation (1) for all students, based on the alphabet
order of the last/first name, and thus it is important for you to follow this guideline, e.g.,
without headers or extra columns/rows in the .csv file and name your .csv file as the above
(b) A (pdf or docx) file that explains the methods used for the prediction. Please name your file as
“2.YourLastName.YourFirstName”, e.g., “2.Mei.Yajun.pdf” or “2.Mei.Yajun.docx” for the name
of the instructor.
Your written report should be like good journal papers that is concise, clearly explain and justify
your proposed models and methods, also see the guidelines on the final report of our course
project. Please feel free to use any methods — this is an open-ended problem, and you can either
use any standard methods you learned from the class, or develop your estimation by a completely
new approach.
Remark: If you upload your files multiple times at Canvas, the file names might be renamed
automatically by Canvas to ”1.YourLastName.YourFirstName.csv-1” or similar. If this occurs, please
do not worry, as we will take this into account and correct for you.
Grading Policies: The total point of this take-home final exam is 25 points, which will be graded
by the TAs and instructor. There are three components:
• Prediction accuracy on mean: 10 points. The smaller MSEµ in (1) the better. We expect
that most students would have their values in the range of [1.0, 1.5]. Thus tentatively, ”10” if
MSEµ ≤ 1.20, ”9” if (1.20, 1.40], ”8” if (1.40, 1.60], ”7” if (1.60, 1.80], ”6” if (1.80, 2.00], ”5” if
(2.00, 3], ”4” if (3, 10], ”3” if (10, 20], ”2” if (20, 30], etc., and we will keep the right to adjust the
grading schedule to be more generous if needed.
• Prediction accuracy on variance: 10 points. The smaller MSEV in (1) the better. We
expect that most students would have their values in the range of [500, 600]. Thus tentatively,
”10” if MSEV ≤ 550, ”9” if (550, 570], ”8” if (570, 590], ”7” if (590, 610], ”6” if (610, 630], ”5” if
(630, 650], ”4” if (650, 700], ”3” if (700, 1000], ”2” if (1000, 5000], etc., and we will keep the right
to adjust the grading schedule to be more generous if needed.
• Written Report: 5 points. There are no specific guidelines on this written report, and please
feel free to use the commonsense. With that said, we will look at the following aspects. Is the
report well-written or easy to read? Is it easy to find the final chosen model or method? Does the
report clearly explain how and why to choose the final chosen method? Does the report discuss
how to suitably tune parameters in the final chosen model? We plan to assign the grades of this
component as follows: “A”- 5, “B”- 4, “C” – 3, “D”-2, “F”- 1, “Not submitted” – 0):
The TAs and instructor will try their best to give fair technical grades to all reasonable answers
(e.g., even if your prediction accuracy is not as good as other students, we keep the right to increase
your prediction accuracy scores if your written report justify well that your proposed method is simple
but useful). However, we acknowledge that ultimately this is a subjective decision.
If needed, please feel free to leave a public or private message at Piazza. Good luck to your final
Appendix: Some useful R codes for (A) training dataset, (B) testing dataset, and (C) our auto-
grading program.
(A) Empirical Data Analysis of training dataset, which might be useful to inspire you to develop
suitable methods for prediction
### Read Training Data
## Assume you save the training data in the folder “C:/temp” in your local laptop
traindata <- read.table(file = "C:/temp/2022Fall7406train.csv", sep=","); dim(traindata); ## dim=10000*202 ## The first two columns are X1 and X2 values, and the last 200 columns are the Y valus ### Some example plots for exploratory data analysis ### please feel free to add more exploratory analysis X1 <- traindata[,1]; X2 <- traindata[,2]; ## compute the empirical estimation of muhat = E(Y) and Vhat = Var(Y) muhat <- apply(traindata[,3:202], 1, mean); Vhat <- apply(traindata[,3:202], 1, var); ## You can construct a dataframe in R that includes all crucial ## information for our exam data0 = data.frame(X1 = X1, X2=X2, muhat = muhat, Vhat = Vhat); ## we can plot 4 graphs in a single plot par(mfrow = c(2, 2)); plot(X1, muhat); plot(X2, muhat); plot(X1, Vhat); plot(X2, Vhat); ## Or you can first create an initial plot of one line ## and then iteratively add the lines ## below is an example to plot X1 vs. muhat for different X2 values ## let us reset the plot ## now plot the lines one by one for each fixed X2 flag <- which(data0$X2 == 0); plot(data0[flag,1], data0[flag, 3], type="l", xlim=range(data0$X1), ylim=range(data0$muhat), xlab="X1", ylab="muhat"); for (j in 1:99){ flag <- which(data0$X2 == 0.01*j); lines(data0[flag,1], data0[flag, 3]); ## You can also plot figures for each fixed X1 or for Vhat ### You are essentially asked to build two models based on "data0": ### one is to predict muhat based on (X1, X2); and ### the other is to predict Vhat based on (X1, X2). (B) Read the testing data and write your prediction on the testing data: ## Testing Data: first read testing X variables testX <- read.table(file = "C:/temp/2022Fall7406test.csv", sep=","); dim(testX) ## This should be a 2500*2 matrix ## Next, based on your models, you predict muhat and Vhat for (X1, X2) in textX. ## Suppose that will lead you to have a new data.frame ## "testdata" with 4 columns, "X1", "X2", "muhat", "Vhat" ## Then you can write them in the csv file as follows: ## (please use your own Last Name and First Name) write.table(testdata, file="C:/temp/1.LastName.FirstName.csv", sep=",", col.names=F, row.names=F) ## Then you can upload the .csv file to the Canvas ## Note that in your final answers, you essentially add two columns for your estimation of ## $mu(X1,X2)=E(Y)$ and $V(X1, X2)=Var(Y)$ ## to the testing X data file "2022Fall7406test.csv". ## Please save your predictions as a 2500*4 data matrix ## in a .csv file "without" headers or extra columns/rows. (C) Our auto-grading program on your prediction (this does not affect your prediction, and it is only for those interested students) ##### In the auto-grading, we run loops, one loop for each student ##### In each loop, we first generate the filename as name1 = "1.LastName.FirstName.csv", ##### Next, we compare your answers with those Monte Carlo based values, ##### "muhatestMC" and "VhatestMC", which were computed as mentioned in the exam. resulttemp <- read.table(file = name1, sep=","); muhatmp <- round(resulttemp[,3], 6); ## Your predicted values for \mu in 6 digits Vhatmp <- round(resulttemp[,4],6); ## Your predicted value of Vhat in 6 digits MSEmu <- mean((muhatestMC - muhatmp)^2); MSEV <- mean((VhatestMC - Vhatmp)^2); ##### Your technical scores will be based on MSEmu and MSEV values ##### In general, the smaller MSEs, the better. ##### However, there is no universal answer on how small is small. ##### Also it is more difficult to have accurate prediction on Variance than on Mean ##### END ##### 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com