Coursework 1 MN-M535
The Boston Housing SPSS file includes information of 506 census housing tracts in the area of Boston, Massachusetts (USA). The data were obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston ).
There are 9 attributes in each case of the dataset. They are:
CRIM per capita crime rate by town
INDUS proportion of non-retail business acres per town.
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
PTRATIO pupil-teacher ratio by town
B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
TAX Tax band levied by the government
The goal is to predict the Tax band levied by the government (TAX) based on information gathered from the predictors.
Why should the data be partitioned into training, validation and test sets? What will the training and test sets be used for in this task?
• Partition the data into the training and test sets in the proportion of 70/30. Perform the subsequent tasks on the training set only.
• Explore the data set by running descriptive analysis, boxplots and histograms. Based on these methods describe the data and make relevant conclusions. Do the data need cleaning? Why? How would you clean them?
• Compute the correlation table for the predictors and search for highly correlated pairs. These have potential redundancy and can cause multicollinearity. Discuss the results (e.g. which variables could/should potentially be removed from the data set, why?).
• Fit a binary logistic regression to predict TAX using the ‘Enter’ attribute selection method. Interpret the SPSS outputs.
• Write the logistic regression model and interpret it.
• Re-run the binary logistic regression model using the ‘Stepwise’ attribute selection method. Explain the outputs and how the attribute selection was performed using this method.
• What is the classification of housing with a Crime rate per capita by town CRIM = 1?
• What is the minimum Crime rate per capita by town (CRIM) before housing would be classified as a band 2 taxpayer?
• Discuss the logistic regression model performance by calculating accuracy metrics such as Precision, Recall and F-measure.
• Now using the final trained model predict the Tax band on the test data set and estimate the accuracy rate. Discuss the results.
作业1 MN-M535
波士顿房屋SPSS文件包含美国马萨诸塞州波士顿地区506个人口普查房屋的信息。数据从StatLib存档(http://lib.stat.cmu.edu/datasets/boston)获得。
每种情况下,数据集都有9个属性。他们是:
按城镇划分的CRIM人均犯罪率
INDUS每个镇的非零售业务英亩比例。
RM每个住宅的平均房间数
1940年之前建造的自有单位的年龄比例
DIS与五个波士顿就业中心的加权距离
PTRATIO按镇划分的师生比例
B 1000(Bk-0.63)^ 2其中Bk是按城镇划分的黑人比例
LSTAT人口地位降低百分比
政府征收的税收税阶
目标是根据从预测变量收集的信息来预测政府征收的税阶。
为什么要将数据划分为训练集,验证集和测试集?训练和测试集将用于此任务中吗?
1.按70/30的比例将数据分为训练集和测试集。仅在训练集中执行后续任务。
2.通过运行描述性分析,箱线图和直方图来探索数据集。基于这些方法描述数据并得出相关结论。数据需要清洁吗?为什么?您将如何清洁它们?
3.计算预测变量的相关表,并搜索高度相关的对。这些具有潜在的冗余并可能导致多重共线性。讨论结果(例如哪些/可能应该/应该从数据集中删除的变量,为什么?)。
4.使用“ Enter”属性选择方法进行二元logistic回归预测TAX。解释SPSS输出。
5.编写逻辑回归模型并进行解释。
6.使用“逐步”属性选择方法重新运行二进制逻辑回归模型。解释输出以及如何使用此方法执行属性选择。
7.按城镇CRIM = 1的人均犯罪率的住房分类是什么?
8.在将住房归为第二类纳税人之前,城镇的最低人均犯罪率是多少?
9.通过计算精度指标(例如Precision,Recall和F-measure)来讨论逻辑回归模型的性能。
10.现在,使用最终的训练模型预测测试数据集上的税阶并估计准确率。讨论结果。