Third homework
The datasets related to the third homework can be found in xlsx format (Excel) in the
Moodle at Week 11 with a short summary, the appropriate file is named
Homework3_data. Everyone should carry out the analysis on the dataset (different
sheets in the Excel), which is next to Her/His Neptun code in the table below.
You can get at most 11 points for the third homework. You should upload a shortly
commented, transparent R script and a textual analysis with at least 3000 characters
(Times New Roman, 12pt, 1.5 line space) in Word or PDF to the Moodle. You should
summarize the applied statistical methods, interpret the gained results and make logical
conclusions. The Word/PDF file should have a title page with Your name, and it also has
to contain the name of the dataset.
Homeworks that are really similar to each other will get 0 points.
Deadline: 2021.12.13. Monday 11:55 pm, Moodle
Tasks:
1. Import the data in an R data frame. Determine the type of variables, transform the
nominal and logical variables. Handle the possible mistakes in the data (excluding
strange values, etc.). Pay attention to factor variables.
2. Build a binary logistic regression model with all the possible explanatory
variables (the dependent variable is highlighted by yellow color in the Excel).
Interpret the significant beta coefficients.
3. Model selection based on the significance levels and information criteria if it is
reasonable.
4. Analyzing the classification matrix with a 50% cut-off for the best model. Interpret
all 5 related measures of explanatory power (accuracy, recalls and precisions).
5. Select the positive category of the dependent variable, create a ROC curve, and evaluate
the model based on the area under the ROC curve (AUC).
6. Change the cut-value, and explain the new, chosen cut-off based on a chosen
(learnt) method. Present the changes of the classification matrix and the related
measures of explanatory power because of the new cut-value.
7. Evaluate the best model based on the McFadden’s pseudo R^2 and the global Chi^2
test (independence test).
The adequate R script (code) worth 5 points, while the analysis is 6 points at most.
You have to interpret the parameters and statistical measures, but the analysis has
to be even wider. You have to make conclusions based on the results. For instance: Why
should we exclude some explanatory variables from the model? Is there any surprising in
the signs of the coefficients? Evaluate the classification matrix properly! What is the
reasoning behind the chosen cut-value? Etc.
If You have any questions, do not hesitate to write me an e-mail (zsombor.szadoczki@uni-
corvinus.hu) or write me on Teams. Good luck! 🙂
Neptun
code
Dataset
bhxc4h Women
c49ja2 MBA
d341y8 MBA
dp7pwa Default
dszh0c MBA
eeghro Employees
fiiz02 Women
foyx5z MBA
gyoyqp Employees
he1mfo MBA
hgv9nd Women
i7xfpl Employees
icw7i0 Default
iinq0p MBA
jfvcsl MBA
jhxyhn Employees
jmv2fp Women
kb9ovg Women
lukyuy Default
ndkj74 Women
ospyt6 Employees
pewimw Employees
rkmzti Default
sen7m8 Women
um8chw Default
urghwh MBA
ve3tdh Women
vg8b92 Women
w9g4w6 MBA
wu6brx Employees
xfa90n Women
mailto:zsombor.szadoczki@uni-corvinus.hu
mailto:zsombor.szadoczki@uni-corvinus.hu