代写 C python graph Go 2019/10/14 Module 3 Assignment

2019/10/14 Module 3 Assignment
Module 3 Assignment
Submit Assignment
Due No Due Date Points 20 Submitting a website url or a file upload
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
1
0
3
Braund, Mr. Owen Harris
male
22
1
0
A/5 21171
7.25
S
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
female
38
1
0
PC 17599
71.2833
C85
C
3
1
3
Heikkinen, Miss. Laina
female
26
0
0
STON/O2. 3101282
7.925
S
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35
1
0
113803
53.1
C123
S
Topics for Assignment 3:
Visualizations in Python with Matplotlib and Plotly
Decision Trees, Creating Testing and Training Data, and Confusion Matrices
Steps:
1) Get a Plotly Account and explore it a bit.
– Make sure you can log in, access the dashboard, and get keys, etc.
– Look at the tutorials and example code.
https://plot.ly/ (https://plot.ly/)
https://plot.ly/python/ (https://plot.ly/python/)
2) Go to Kaggle to download the Titanic Training Dataset. This dataset will look like this:
3) Clean the prepare the data. You should already have much of this done from previous Assignments. You can always repurpose your own work. When you clean – REMOVE variables that you will not use for
https://georgetown.instructure.com/courses/81420/assignments/229235 1/3

2019/10/14 Module 3 Assignment
analysis – such as Name. YOU determine which variables are useful and which are not.
4) A very important part of evaluating, understanding, cleaning, and preparing data is visualization. Choose (smartly) 6 of the most important variables from the dataset. Write code in Python to visualize each variable. Use matplotlib for three of the visualizations and use plotly for the other three.
– Use color
– Use proper labels and titles
– IMPORTANT: choose the right plot for the variable you are visualizing. For example, you would not choose a pie graph for Age or a histogram for Sex.
– Under each vis, write exactly three sentences (no more and no less) that describes the *information* that the vis is showing. So, do not describe the vis itself. For example, do not say, “this is a bar graph…” as that part if clear from looking at it 🙂 Instead, ask yourself what the vis *tells* us about the variable. Is the variable balanced? Are there more, less, or similar number of males and females? Are the ages normally distributed? Are the Ages normally distributed just for those who survived? etc. etc. These are just thoughts and suggestions to help you to start think about using vis to LEARN about data.
5) Once you have done the above, you will add Decision Tree code to your Python code. In other words, you will build several decision trees that are “built” with training data that you create and then tested with test data that you create.
HINTS:
– You will need to take this Titanic Training Dataset from Kaggle and split it into two datasets (using Python – not by hand). The two datasets are Training and Testing. Do not worry that the name of the dataset in Kaggle is “Titanic Training Dataset”. The name of the dataset does not matter. You can take any labeled data and can pull out a subset as the Testing Set and the rest as the Training set. Using this Training dataset AND PYTHON – code Python to create a Testing and a Training dataset.
– Notice that this dataset is a labeled dataset. The labels are called “Survived” and the possible values are 0 (did not survive) and 1 (did survive).
– Because Testing data cannot be labeled, remove BUT SAVE the labels IN ANOTHER DF. You will need the actual testing set labels to see if your model did well in its predictions.
6) Once you have a Testing set and a Training set, use Python to code, train, and test a Decision Tree. 7) Create a Confusion Matrix to show the results.
– What is the accuracy?
– What is the error rate?
8) Next, update your code to perform 5-fold cross validation (for the DT). This means that you will use 1/5 of the data for the testset and 4/5 for the training set. You will train the model. You will test the model. You will build a confusion matrix for the results. Then, you will repeat these steps 4 more times with DIFFERENT test sets each time. You will end up with 5 confusion matrices. What is the average accuracy of your model?
9) Finally, update your dataset so that:
– You create a new variable that is the discretization of the Age variable. In other words, place the ages into 5 categories. You can choose the groups (wisely).
https://georgetown.instructure.com/courses/81420/assignments/229235
2/3

2019/10/14 Module 3 Assignment
– In this new dataset, you will have a NEW variable that is called AgeBins and you will remove the original Age variable.
– Next, create a new variable that is the sum of SibSp and Parch. Once you have this new variable, remove SibSp and Parch.
– Once you do the above, you will have a “new” looking dataset that contains qualitative Age data, and fewer dimensions. Think about why that is good – especially when training a model.
10) Using this new dataset – create one testset (1/5) and one training set (4/5). Create a new DT, and the confusion matrix. How did this updated dataset do compared to the other?
11) ** For every DT model you build, you will have one confusion matrix AND also create one decision tree visualization.
HOW TO SUBMIT:
1) Create a Word Document that shows and discusses all of your results.
2) Create a zip that contains the Word doc and the .py code.
3) Submit the zip to canvas. DO not use any other formats – such as rar or pdf, etc.
NOTE: All of your visualizations and results will go into the Doc. Think about how to write up this doc so that a hiring manager can understand it (and your skill) – because I might share it!
https://georgetown.instructure.com/courses/81420/assignments/229235
3/3