CS考试辅导 PC 17599 71.2833 C85 C

Logistic_Regression

Logistic Regression¶
In this notebook we explore prediction tasks where the response variable is categorical instead of numeric and look at a common classification technique known as logistic regression. We apply this technique to a data_set containing survival data for the passengers of the Titanic.

As part of the analysis, we will be doing the following:
Data extraction : we’ll load the dataset and have a look at it.
Cleaning : we’ll fill in some of the missing values.
Plotting : we’ll create several charts that will (hopefully) help identify correlations and other insights

The dataset and more information regarding this dataset, including more tutorials can be found here: https://www.kaggle.com/c/titanic

Install Relevant Libraries¶

#!pip install numpy
#!pip install pandas
#!pip install scikit-learn
#!pip install matplotlib
#!pip install seaborn

Import Relevant Libraries¶

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import make_scorer, confusion_matrix
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style(‘whitegrid’)

# The following line is needed to show plots inline in notebooks
%matplotlib inline

Read in Data¶

trainDF = pd.read_csv(‘Titanic_train.csv’)
testDF = pd.read_csv(‘Titanic_test.csv’)

General Information of Imported Data¶
First, let’s have a look at our data:

Training Data¶

trainDF.head()

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. ( Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. ( Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. male 35.0 0 0 373450 8.0500 NaN S

trainDF.info()

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

trainDF.describe()

PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Testing Data¶

testDF.head()

testDF.info()

RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

testDF.describe()

PassengerId Pclass Age SibSp Parch Fare
count 418.000000 418.000000 332.000000 418.000000 418.000000 417.000000
mean 1100.500000 2.265550 30.272590 0.447368 0.392344 35.627188
std 120.810458 0.841838 14.181209 0.896760 0.981429 55.907576
min 892.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 996.250000 1.000000 21.000000 0.000000 0.000000 7.895800
50% 1100.500000 3.000000 27.000000 0.000000 0.000000 14.454200
75% 1204.750000 3.000000 39.000000 1.000000 0.000000 31.500000
max 1309.000000 3.000000 76.000000 8.000000 9.000000 512.329200

Above, the survived column from the training set is the taret/dependent/response variable. A score of 1 means the passenger survived, and a score of 0 means the passenger died.

There are also various features (variables) that describe each passenger:

PassengerID: ID assigned to traveller on boat
Pclass: The class of the passenger, either 1, 2, or 3
Name: Name of the passenger
Sex: Sex of the passenger
Age: Age of the passenger
SibSp: Number of siblings/spouses travelling with the passenger
Parch: Number of parents/children travelling with the passenger
Ticket: The ticket number of the passenger
Fare: The ticket fare of the passenger
Cabin: The cabin number of the passenger
Embarked: The port of embarkation, either S, C, or Q (C = Cherbourg, Q = Queenstown, and S = Southhampton)

Data Exploration¶
First, lets look at the number of people that survived.

sns.countplot(x=’Survived’,data=trainDF)

Most people did not survive.

Let’s look at this even further by looking at the number of people that survived by sex.

sns.countplot(x=’Survived’, hue=’Sex’, data=trainDF)

Here we can see that it more males died than females, and that most of the females survived.

Now lets look at the survival count by class.

sns.countplot(x=’Survived’,hue=’Pclass’, data=trainDF)

Here we can see that the majority of the first class survived while a majority of the 3rd class perished.

Lets look at the fare distribution.

plt.hist(x=’Fare’,data=trainDF,bins=40)

(array([385., 177., 139., 31., 41., 26., 31., 8., 11., 4., 7.,
9., 2., 0., 0., 0., 4., 5., 0., 2., 6., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 3.]),
array([ 0. , 12.80823, 25.61646, 38.42469, 51.23292, 64.04115,
76.84938, 89.65761, 102.46584, 115.27407, 128.0823 , 140.89053,
153.69876, 166.50699, 179.31522, 192.12345, 204.93168, 217.73991,
230.54814, 243.35637, 256.1646 , 268.97283, 281.78106, 294.58929,
307.39752, 320.20575, 333.01398, 345.82221, 358.63044, 371.43867,
384.2469 , 397.05513, 409.86336, 422.67159, 435.47982, 448.28805,
461.09628, 473.90451, 486.71274, 499.52097, 512.3292 ]),
)

Here we can see that most people paid under 50, but there are some outliers like the people at the $500 range. This is explained by the difference in the number of people in each class. The lowest class, 3, has the most people and the highest class has the least. The lowest class paid the lowest fare so there are more people in this category.

Finally, lets look at the number of missing data using a heatmap.

fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(trainDF.isnull(), cmap=’coolwarm’, yticklabels=False, cbar=False, ax=ax)

Let’s look at all the samples that have NaN values in the Pandas dataframe.

trainDF[(trainDF[‘Age’].isnull()) | (trainDF[‘Cabin’].isnull() ) | (trainDF[‘Embarked’].isnull())]

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. male 22.0 1 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
… … … … … … … … … … … … …
884 885 0 3 Sutehall, Mr. male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William ( ) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
888 889 0 3 Johnston, Miss. “Carrie” female NaN 1 2 W./C. 6607 23.4500 NaN S
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

708 rows × 12 columns

Let’s do the same thing with the unlabelled set.

fig, ax = plt.subplots(figsize=(12,5))
sns.heatmap(testDF.isnull(), cmap=’coolwarm’, yticklabels=False, cbar=False, ax=ax)

testDF[(testDF[‘Age’].isnull()) | (testDF[‘Cabin’].isnull() ) | (testDF[‘Embarked’].isnull())]

PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James ( ) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
… … … … … … … … … … … …
412 1304 3 Henriksson, Miss. female 28.0 0 0 347086 7.7750 NaN S
413 1305 3 Spector, Mr. Woolf male NaN 0 0 A.5. 3236 8.0500 NaN S
415 1307 3 Saether, Mr. male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S
416 1308 3 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S
417 1309 3 Peter, Master. male NaN 1 1 2668 22.3583 NaN C

331 rows × 11 columns

Data Cleaning¶
Now lets clean up the data so that it can be used with a scikit-learn model.

Missing Data¶
Embarked Nulls¶
First, lets eal with the NaNs in our dataset. We’ll first start with the NaNs in the Embarked feature.

We have two passengers without locations of embarkment. Both passengers survived and have the same ticket number. They also both belonged to first class.

trainDF[trainDF.Embarked.isnull()]

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. ( ) female 62.0 0 0 113572 80.0 B28 NaN

Let’s look at the survival chances depending on the port of embarkation.

fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(15,5))

# Plot the number of occurances for each embarked location
sns.countplot(x=’Embarked’, data=trainDF, ax=ax1)

# Plot the number of people that survived by embarked location
sns.countplot(x=’Survived’, hue = ‘Embarked’, data=trainDF, ax=ax2, order=[1,0])

# Group by Embarked, and get the mean for survived passengers for each
# embarked location
embark_pct = trainDF[[‘Embarked’,’Survived’]].groupby([‘Embarked’],as_index=False).mean()
# Plot the above mean
sns.barplot(x=’Embarked’,y=’Survived’, data=embark_pct, order=[‘S’,’C’,’Q’], ax=ax3)

Here we can see that most people embarked from S, and because of that most people that survived were S. However, when we look at the average of the number of people that survived vs. the total number of people that boarded by boarding location, S had the lowest survival rate.

This is not definitive enough to conclude which port the above people boarded on. Lets look at other variables that may indicate where passengers embarked the ship.

Lets look at if anyone else shared their ticket number.

trainDF.loc[trainDF[‘Ticket’] == ‘113572’]

There are no other users that share the same ticket number. Let’s look for people of the same class that paid similar fares.

trainDF[(trainDF[‘Pclass’] == 1) & (trainDF[‘Fare’] > 75) & (trainDF[‘Fare’] < 85)].groupby('Embarked')['PassengerId'].count() Name: PassengerId, dtype: int64 Of the people that have the same class, and paid similar fares, 16 people embarked from C, and 13 people embarked from S. Now, since most of the people in the same class that paid similar fares come from C, and that people that embarked from C have the highest surivival rate, we concdlue that these people likely embarked from C. We will now change their Embarked value to C. # Set Value trainDF.at[trainDF['Embarked'].isnull(), 'Embarked'] = 'C' trainDF[trainDF['Embarked'].isnull()] PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Fare nulls¶ Now let's deal with the missing values in the Fare column. testDF[testDF['Fare'].isnull()] PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 152 1044 3 Storey, Mr. Thomas male 60.5 0 0 3701 NaN NaN S Lets visualize a histogram of the fares paid by the 3rd class pasengers who embarked from Southampton. fig,ax = plt.subplots(figsize=(8,5)) testDF[(testDF['Pclass'] == 3) & (testDF['Embarked'] == 'S')]['Fare'].hist(bins=100, ax=ax) plt.xlabel('Fare') plt.ylabel('Frequency') plt.title('Histogram of Fare for Pclass = 3, Embarke = S') Text(0.5, 1.0, 'Histogram of Fare for Pclass = 3, Embarke = S') print ("The top 5 most common fares:") testDF[(testDF['Pclass'] == 3) & (testDF['Embarked'] == 'S')]['Fare'].value_counts().head() The top 5 most common fares: 8.0500 17 7.7750 10 7.8958 10 8.6625 8 7.8542 8 Name: Fare, dtype: int64 Lets fill the missing value with the most common fare, $8.05. # Fill value testDF.at[testDF.Fare.isnull(), 'Fare'] = 8.05 testDF[testDF['Fare'].isnull()] PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age nulls¶ Now lets fill the missing age data in the both the training and test set. One way of filling is to fill the NaNs with the means of the column. This is known as imputing. We can make this filling process more intelligent by looking at mean age by class. trainDF[trainDF['Age'].isnull()] PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q 17 18 1 2 Williams, Mr. male NaN 0 0 244373 13.0000 NaN S 19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C 26 27 0 3 Emir, Mr. male NaN 0 0 2631 7.2250 NaN C 28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q ... ... ... ... ... ... ... ... ... ... ... ... ... 859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C 863 864 0 3 Sage, Miss. "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S 868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S 878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S 888 889 0 3 Johnston, Miss. "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S 177 rows × 12 columns plt.figure(figsize=(12,7)) sns.boxplot(x='Pclass', y='Age', data=trainDF)

trainDF.groupby(‘Pclass’)[‘Age’].mean()

1 38.233441
2 29.877630
3 25.140620
Name: Age, dtype: float64

We see that the higher the class, the higher the average age which makes sense. We can then write a function to fill the NaN age values using the above means.

def fixNaNAge(age, pclass):
if age == age:
return age
if pclass == 1:
elif pclass == 2:

Now we will fill the age NaNs in both the training and testing dataframe and verify that they were filled correctly.

trainDF[‘Age’] = trainDF.apply(lambda row: fixNaNAge(row[‘Age’],row[‘Pclass’]),axis=1)
testDF[‘Age’] = testDF.apply(lambda row: fixNaNAge(row[‘Age’],row[‘Pclass’]), axis=1)

trainDF[trainDF[‘Age’].isnull()]

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

testDF[testDF[‘Age’].isnull()]

PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

fig = plt.figure(figsize=(15,5))
trainDF[‘Age’].astype(int).hist(bins=70)
testDF[‘Age’].astype(int).hist(bins=70)

facet = sns.FacetGrid(trainDF, hue=’Survived’, aspect=4)
facet.map(sns.kdeplot, ‘Age’, shade=True)
facet.set(xlim=(0, trainDF[‘Age’].max()))
facet.add_legend()

fig, ax = plt.subplots(1,1,figsize=(18,4))

age_mean = trainDF[[‘Age’,’Survived’]].groupby([‘Age’],as_index=False).mean()

sns.barplot(x=’Age’, y=’Survived’, data=age_mean)

Cabin nulls¶
Finally, for the cabin column, we are missing too much information to fill it properly so we can drop the feature entirely.

trainDF.drop(‘Cabin’, axis=1,inplace=True)
testDF.drop(‘Cabin’, axis=1, inplace=True)

Adding features¶

The names have a prefix that, in some cases, is indicative of the social status, which may have been be an important factor in surviving the accident.
Braund, Mr.
Heikkinen, Miss. y Ocana, Dona. , Master.

Extracting the passenger titles and storring them in an additional column called Title

Title_Dictionary = {
“Capt”: “Officer”,
“Col”: “Officer”,
“Major”: “Officer”,
“Jonkheer”: “Nobel”,
“Don”: “Nobel”,
“Sir” : “Nobel”,
“Dr”: “Officer”,
“Rev”: “Officer”,
“the Countess”: “Nobel”,
“Dona”: “Nobel”,
“Mme”: “Mrs”,
“Mlle”: “Miss”,
“Ms”: “Mrs”,
“Mr” : “Mr”,
“Mrs” : “Mrs”,
“Miss” : “Miss”,
“Master” : “Master”,
“Lady” : “Nobel”

trainDF[‘Title’] = trainDF[‘Name’].apply(lambda x: Title_Dictionary[x.split(‘,’)[1].split

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts