don-t-know-why-employees-leave-read-this
Copyright By PowCoder代写 加微信 powcoder
Human Resources Analytics : Exploration Data Analysis and modeling¶
17/08/2017¶
I really enjoyed writing this notebook. If you like it or it helps you , you can upvote and/or leave a comment :).
1 Introduction
2 Load and check data 2.1 load data
2.2 check for missing values
3 Global data exploration
4 Detailed data exploration 4.1 Normalisation and dimensionalty reduction
4.2 Global Radar Chart
4.3 Left and other features
4.4 Clustering analysis
5 Modeling 5.1 Decision Tree
5.2 Random Forest
1. Introduction¶
The Human Resources Analytics is a dataset providing informations on the situation and work of several ten of thousands employees.
In this kernel ill focus on one very important question : Why employees are leaving the company ?
To tackle this question , this notebook combines data exploration analysis and modeling.
This dataset is perfect for this kind of detailed data exploration because it contains a few number of features a large number of individual, so we can perform robust statistics. Firstlty, ill globally explore the dataset, then ill focus on a detailed exploration analysis of the stayed/left employees and ill end by the data modeling.
This script follows three main parts:
Global data exploration
Detailed data exploration
Data modeling
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
import pydotplus
%matplotlib inline
from sklearn.model_selection import GridSearchCV, cross_val_score, learning_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, Normalizer, RobustScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, Isomap
from sklearn.cluster import KMeans
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from IPython.display import Image
from sklearn.externals.six import StringIO
from bokeh.io import output_notebook
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.models import HoverTool
output_notebook()
import warnings
#import pydot
warnings.filterwarnings(‘ignore’)
sns.set(style=’white’, context=’notebook’, palette=’deep’)
np.random.seed(seed=2)
Loading BokehJS …
2. Load and Check data¶
2.1 Load the data¶
# Load the data
dataset = pd.read_csv(“./kaggle_hr_analytics.csv”)
dataset.shape
(14999, 10)
# Look at the train set
dataset.head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.80 0.86 5 262 6 0 1 0 sales medium
1 0.11 0.88 7 272 4 0 1 0 sales medium
2 0.85 0.91 5 226 5 0 1 0 management medium
3 0.11 0.93 7 308 4 0 1 0 IT medium
4 0.10 0.95 6 244 5 0 1 0 IT medium
This dataset contains 14999 rows described by 10 features.
There are 8 numerical features and 2 categorical features. Sales is non nominal
Salary is ordinal
The feature of our interests is the ‘left’ feature, it encoded in {0,1} 0 for stayed employees and 1 if not.
2.2 check for missing values¶
# Check for missing values
dataset.isnull().any()
satisfaction_level False
last_evaluation False
number_project False
average_montly_hours False
time_spend_company False
Work_accident False
left False
promotion_last_5years False
sales False
salary False
dtype: bool
The dataset is already clean, there is no missing values at all , great !!!
3. Global data exploration¶
Here, i display histograms of the 10 features for a global analysis.
fig, axs = plt.subplots(ncols=2,figsize=(12,6))
g = sns.countplot(dataset[“sales”], ax=axs[0])
plt.setp(g.get_xticklabels(), rotation=45)
g = sns.countplot(dataset[“salary”], ax=axs[1])
plt.tight_layout()
plt.show()
plt.gcf().clear()
fig, axs = plt.subplots(ncols=3,figsize=(12,6))
sns.countplot(dataset[“Work_accident”], ax=axs[0])
sns.countplot(dataset[“promotion_last_5years”], ax=axs[1])
sns.countplot(dataset[“left”], ax=axs[2])
plt.tight_layout()
plt.show()
plt.gcf().clear()
Our target variable (left) is unbalanced, since we have not more than 10x it is still reasonable.
fig, axs = plt.subplots(ncols=3,figsize=(12,6))
sns.distplot(dataset[“satisfaction_level”], ax=axs[0])
sns.distplot(dataset[“last_evaluation”], ax=axs[1])
sns.distplot(dataset[“average_montly_hours”], ax=axs[2])
plt.tight_layout()
plt.show()
plt.gcf().clear()
These distplots show something very interesting. It seems that there is two distributions mixed in satisfaction_level, last_evaluation and average_montly_hours data distributions.
Is that corresponding to employees who stay and left ?
fig, axs = plt.subplots(ncols=2,figsize=(12,6))
axs[0].hist(dataset[“number_project”],bins=6)
axs[0].set_xlabel(“number_project”)
axs[0].set_ylabel(“Count”)
axs[1].hist(dataset[“time_spend_company”],bins=10,color=”r”,range=(1,10))
axs[1].set_xlabel(“time_spend_company”)
axs[1].set_ylabel(“Count”)
plt.tight_layout()
plt.show()
plt.gcf().clear()
The number of projects and the time spend in company seem to follow an extrem value distribution (Gumbel distribution).
Time_spend_company is very positively skewed (right skewed).
g = sns.heatmap(dataset.corr(),annot=True,cmap=”RdYlGn”)
It seems that employees working hard and with many projects have a better evaluation. (corr(number_project,last_evaluation) : 0.35, corr(average_montly_hours,last_evaluation) : 0.34 ).
The most important thing in this correlation matrix is the negative correlation between ‘left’ and ‘satifaction_level’ (-0.39) : employees leave because they are not happy at work ?
Is that the only main reason ?
Is there employee patterns that can explained that ?
To adress these questions, i performed a detailed analysis.
4. Detailed data exploration¶
Firslty, i will perform a dimensionality reduction in order to identify groups.
dataset = dataset.drop(labels=[“sales”],axis = 1)
dataset[“salary”] = dataset[“salary”].astype(“category”,ordered=True, categories = [‘low’,’medium’,’high’]).cat.codes
4.1 Normalisation and dimensionalty reduction¶
# pca/isomap analysis
N = StandardScaler()
N.fit(dataset)
dataset_norm = N.transform(dataset)
Don’t forget to normalize the data before the demensionality reduction.
index = np.random.randint(0,dataset_norm.shape[0],size=10000)
dataset.head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years salary
0 0.80 0.86 5 262 6 0 1 0 1
1 0.11 0.88 7 272 4 0 1 0 1
2 0.85 0.91 5 226 5 0 1 0 1
3 0.11 0.93 7 308 4 0 1 0 1
4 0.10 0.95 6 244 5 0 1 0 1
Because of the size of the dataset the isomap algorithm is very memory greedy. So i randomly choosed a 10 000 points in the dataset.
The isomap and pca maps are very similar to the ones obtained from the full dataset and are much faster to compute.
pca = PCA(n_components=2)
pca_representation = pca.fit_transform(dataset_norm[index])
iso = Isomap(n_components=2, n_neighbors=40)
iso_representation = iso.fit_transform(dataset_norm[index])
left_colors = dataset[“left”].map(lambda s : “g” if s==0 else “r”)
fig, axes = plt.subplots(1,2,figsize=(15,6))
axes[0].scatter(pca_representation[:,0],pca_representation[:,1],
c = left_colors[index],alpha=0.5,s=20)
axes[0].set_title(“Dimensionality reduction with PCA”)
axes[0].legend([“Left employee”])
axes[1].scatter(iso_representation[:,0],iso_representation[:,1],
c = left_colors[index],alpha=0.5,s=20)
axes[1].set_title(“Dimensionality reduction with Isomap”)
axes[1].legend([“Left employee”])
The red points correspond to employees who left. Here the PCA doesn’t show a great separation between the left and stayed employees. PCA performs a linear demensionality reduction , the components produced by pca is a linear combinaison of the existing features. So it is very good when we have a linear relation between the points.
Here it seems that we need a non linear reduction like isomap does. We can see a great separation between the red and green points. An interesting fact is that we have two groups of employees who stayed (green points).
Let’s represent this with an interactive plot
source_dataset = ColumnDataSource(
data = dict(
x = iso_representation[:2000,0],
y = iso_representation[:2000,1],
desc = dataset.loc[index,”left”],
colors = [“#%02x%02x%02x” % (int(c*255), int((1-c)*255), 0)
for c in dataset.loc[index,”left”]],
satisfaction_level = dataset.loc[index,’satisfaction_level’],
last_evaluation = dataset.loc[index,’last_evaluation’],
number_project = dataset.loc[index,’number_project’],
time_spend_company = dataset.loc[index,’time_spend_company’],
average_montly_hours = dataset.loc[index,’average_montly_hours’]))
hover = HoverTool(tooltips=[(“Left”,
(“Satisf. level”,
(“#projects”,
(“Last eval.”,
(“Time in Company”,
(“Montly hrs”,
tools_isomap = [hover, “box_zoom”,’pan’, ‘wheel_zoom’, ‘reset’]
plot_isomap = figure(plot_width= 800, plot_height=600, tools=tools_isomap,
title=’Isomap projection of employee data’)
plot_isomap.scatter(‘x’, ‘y’, size=7, fill_color = “colors”, line_color = None,
fill_alpha = 0.6, radius=0.1, alpha=0.5, line_width=0,
source=source_dataset)
show(plot_isomap)
You can hover the data to see the major features.
4.2 Global Radar Chart¶
data_stay = dataset[dataset[“left”]==0]
data_left = dataset[dataset[“left”]==1]
For practical reasons, i separate the left and stay data.
def _scale_data(data, ranges):
(x1, x2) = ranges[0]
d = data[0]
return [(d – y1) / (y2 – y1) * (x2 – x1) + x1 for d, (y1, y2) in zip(data, ranges)]
class RadarChart():
def __init__(self, fig, variables, ranges, n_ordinate_levels = 6):
angles = np.arange(0, 360, 360./len(variables))
axes = [fig.add_axes([0.1,0.1,0.8,0.8],polar = True,
label = “axes{}”.format(i)) for i in range(len(variables))]
_, text = axes[0].set_thetagrids(angles, labels = variables)
for txt, angle in zip(text, angles):
txt.set_rotation(angle – 90)
for ax in axes[1:]:
ax.patch.set_visible(False)
ax.xaxis.set_visible(False)
ax.grid(“off”)
for i, ax in enumerate(axes):
grid = np.linspace(*ranges[i],num = n_ordinate_levels)
grid_label = [“”]+[“{:.1f}”.format(x) for x in grid[1:]]
ax.set_rgrids(grid, labels = grid_label, angle = angles[i])
ax.set_ylim(*ranges[i])
self.angle = np.deg2rad(np.r_[angles, angles[0]])
self.ranges = ranges
self.ax = axes[0]
def plot(self, data, *args, **kw):
sdata = _scale_data(data, self.ranges)
self.ax.plot(self.angle, np.r_[sdata, sdata[0]], *args, **kw)
def fill(self, data, *args, **kw):
sdata = _scale_data(data, self.ranges)
self.ax.fill(self.angle, np.r_[sdata, sdata[0]], *args, **kw)
def legend(self, *args, **kw):
self.ax.legend(*args, **kw)
attributes = [‘satisfaction_level’,’last_evaluation’,’number_project’,
‘average_montly_hours’,’time_spend_company’]
data_stay_mean = data_stay[attributes].mean().values.reshape(1,-1)
data_left_mean = data_left[attributes].mean().values.reshape(1,-1)
datas = np.concatenate((data_stay_mean,data_left_mean),axis = 0)
ranges = [[1e-2, dataset[attr].max()] for attr in attributes]
colors = [“green”,”red”]
left_types = [“Stayed”,”Left”]
fig = plt.figure(figsize=(8, 8))
radar = RadarChart(fig, attributes, ranges)
for data, color, left_type in zip(datas, colors, left_types):
radar.plot(data, color = color, label = left_type, linewidth=2.0)
radar.fill(data, alpha = 0.2, color = color)
radar.legend(loc = 1, fontsize = ‘medium’)
plt.title(‘Stats of employees who stayed and left’)
plt.show()
This radar chart doesn’t show so many diffrences between left and stayed employees. At first glance the main difference seems to be the satisfaction level.
As we demonstrate above, employees who left are less happy than the others.
However this radar chart is build on the mean of each feature, so it could hide some sub-distributions in the data.
Let’s investigate this in further analysis.
4.3 Left and other features¶
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(10,6))
sns.factorplot(y=”satisfaction_level”,x=”left”,data=dataset,kind=”box”, ax=axs[0])
axs[1].hist(data_stay[“satisfaction_level”],bins=6,label=”Stay”,alpha=0.7)
axs[1].hist(data_left[“satisfaction_level”],bins=6,label=”Left”,alpha=0.7)
axs[1].set_xlabel(“Satifaction level”)
axs[1].set_ylabel(“Count”)
axs[1].legend()
plt.tight_layout()
plt.gcf().clear()
The satisfaction level is the most correlated feature with ‘left’. Here we can see that employees who left have a lower satisfaction level that those who stayed.
We can also noticed the three sub-distributions of satisfaction levels with employees who left. Is that corresponding to 3 groups ?
One with a low satisfaction level
One with a medium satisfaction level
One with a high satisfaction level
salary_counts = (dataset.groupby([‘left’])[‘salary’]
.value_counts(normalize=True)
.rename(‘percentage’)
.reset_index())
p = sns.barplot(x=”salary”, y=”percentage”, hue=”left”, data=salary_counts)
p.set_ylabel(“Percentage”)
p = p.set_xticklabels([“Low”,”Medium”,”High”])
Let’s invistigate for the salary of employees who left/stayed.
Here i show the percentage of the employees with a low/medium/high salary in the two categories.
Employees who left have a lower salary than other.
Is that the reason why employees left ?
fig, axs = plt.subplots(nrows=1,ncols=3,figsize=(17,6))
sns.factorplot(y=”number_project”,x=”left”,data=dataset,kind=”bar”, ax=axs[0])
axs[1].hist(data_stay[“number_project”],bins=6,label=”Stay”,alpha=0.7)
axs[1].hist(data_left[“number_project”],bins=6,label=”Left”,alpha=0.7)
axs[1].set_xlabel(“Number of projects”)
axs[1].set_ylabel(“Count”)
axs[1].legend()
ax = sns.kdeplot(data=data_stay[“satisfaction_level”],color=’b’,shade=True,ax=axs[2])
ax = sns.kdeplot(data=data_left[“satisfaction_level”],color=’g’,shade=True, ax=axs[2])
ax.legend([“Stay”,”Left”])
ax.set_xlabel(‘Satifsfaction level’)
ax.set_ylabel(‘Density’)
plt.tight_layout()
plt.gcf().clear()
Let’s see now if they have more work than the others.
Left and stayed employees have a similar number of projects.
However, when we look in detail, there is two sub population in the employees who left. Those who have few projects and those who have a lot of projects.
fig, axs = plt.subplots(nrows=1,ncols=3,figsize=(17,6))
sns.factorplot(y=”last_evaluation”,x=”left”,data=dataset,kind=”bar”, ax=axs[0])
axs[1].hist(data_stay[“last_evaluation”],bins=6,label=”Stay”,alpha=0.7)
axs[1].hist(data_left[“last_evaluation”],bins=6,label=”Left”,alpha=0.7)
axs[1].set_xlabel(“Last evaluation”)
axs[1].set_ylabel(“Count”)
axs[1].legend()
ax = sns.kdeplot(data=data_stay[“last_evaluation”],color=’b’,shade=True, ax=axs[2])
ax = sns.kdeplot(data=data_left[“last_evaluation”],color=’g’,shade=True, ax=axs[2])
ax.legend([“Stay”,”Left”])
ax.set_xlabel(‘last_evaluation’)
ax.set_ylabel(‘Density’)
plt.tight_layout()
plt.gcf().clear()
When we look at the last evaluation we still have two sub populations of left employees.
Those with a medium score and those with an high score, that’s very interesting !
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(10,6))
sns.factorplot(y=”average_montly_hours”,x=”left”,data=dataset,kind=”box”, ax=axs[0])
axs[1].hist(data_stay[“average_montly_hours”],bins=6,label=”Stay”,alpha=0.7)
axs[1].hist(data_left[“average_montly_hours”],bins=6,label=”Left”,alpha=0.7)
axs[1].set_xlabel(“Average Montly Hours”)
axs[1].set_ylabel(“Count”)
axs[1].legend()
plt.tight_layout()
plt.gcf().clear()
Similarly to the evaluation score and the number of projects. There is two sub populations of employees who left. Those who work less and those who work a lot.
Since the evaluation score, the number of projects and the average montly hours are correlated each other, we can make the hypothesis that there is two groups of employee who leaves. Those who work less because they have gets lower scores
g = sns.pairplot(dataset.drop(labels=[‘promotion_last_5years’,’Work_accident’,’salary’],axis=1),hue=”left”,plot_kws=dict(alpha=0.1))
handles = g._legend_data.values()
labels = g._legend_data.keys()
g.fig.legend(handles=handles, labels=labels, loc=’upper center’, ncol=1)
g.fig.legend(handles=handles, labels=labels, loc=’lower center’, ncol=3)
The pairplot shows very interesting patterns when we plot the average montly hours against the satisfaction level or the satifaction level against the evaluation score.
It’s like we still have 2/3 kind of employees who left.
Let’s analyse these groups in detail.
# Deeper in the analysis
g = sns.FacetGrid(dataset, col=”left”, hue=”left”,size=5,aspect=1.2)
g.map(plt.scatter, “satisfaction_level”, “last_evaluation”,alpha=0.15)
g.add_legend()
g = sns.FacetGrid(dataset, col=”left”,size=5,aspect=1.2)
g.map(sns.kdeplot, “satisfaction_level”, “last_evaluation”,shade=True,shade_lowest=False)
g.add_legend()
We have three groups of employees who left.
– Successfull but unhappy employees (top left)
– Successfull and happy employees (top right)
– Unsuccessfull and unhappy employees (bottom center)
Now we want to label the data with this tree groups.
4.4 Clustering analysis¶
# Lets compare inside the 3 identified groups
kmeans = KMeans(n_clusters=3,random_state=2)
kmeans.fit(data_left[[“satisfaction_level”,”last_evaluation”]])
KMeans(algorithm=’auto’, copy_x=True, init=’k-means++’, max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances=’auto’,
random_state=2, tol=0.0001, verbose=0)
I performed a kmean clustering to isolate these three groups.
kmeans_colors = [‘red’ if c == 0 else ‘orange’ if c == 2 else ‘blue’ for c in kmeans.labels_]
fig = plt.figure(figsize=(10, 7))
plt.scatter(x=”satisfaction_level”,y=”last_evaluation”, data=data_left,
alpha=0.25,color = kmeans_colors)
plt.xlabel(“Satisfaction lev
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com