KaggleSalary_DataSet
Creating DataSet¶
Copyright By PowCoder代写 加微信 powcoder
Reading Input Data¶
import pandas as pd
from google.colab import files
uploaded = files.upload()
Upload widget is only available when the cell has been executed in the
current browser session. Please rerun this cell to enable.
Saving kaggle_survey_2021_responses.csv to kaggle_survey_2021_responses (1).csv
Salaries = pd.read_csv(“kaggle_survey_2021_responses.csv”, low_memory = False)
Salaries.head()
Time from Start to Finish (seconds) Q1 Q2 Q3 Q4 Q5 Q6 Q7_Part_1 Q7_Part_2 Q7_Part_3 Q7_Part_4 Q7_Part_5 Q7_Part_6 Q7_Part_7 Q7_Part_8 Q7_Part_9 Q7_Part_10 Q7_Part_11 Q7_Part_12 Q7_OTHER Q8 Q9_Part_1 Q9_Part_2 Q9_Part_3 Q9_Part_4 Q9_Part_5 Q9_Part_6 Q9_Part_7 Q9_Part_8 Q9_Part_9 Q9_Part_10 Q9_Part_11 Q9_Part_12 Q9_OTHER Q10_Part_1 Q10_Part_2 Q10_Part_3 Q10_Part_4 Q10_Part_5 Q10_Part_6 … Q34_B_Part_6 Q34_B_Part_7 Q34_B_Part_8 Q34_B_Part_9 Q34_B_Part_10 Q34_B_Part_11 Q34_B_Part_12 Q34_B_Part_13 Q34_B_Part_14 Q34_B_Part_15 Q34_B_Part_16 Q34_B_OTHER Q36_B_Part_1 Q36_B_Part_2 Q36_B_Part_3 Q36_B_Part_4 Q36_B_Part_5 Q36_B_Part_6 Q36_B_Part_7 Q36_B_OTHER Q37_B_Part_1 Q37_B_Part_2 Q37_B_Part_3 Q37_B_Part_4 Q37_B_Part_5 Q37_B_Part_6 Q37_B_Part_7 Q37_B_OTHER Q38_B_Part_1 Q38_B_Part_2 Q38_B_Part_3 Q38_B_Part_4 Q38_B_Part_5 Q38_B_Part_6 Q38_B_Part_7 Q38_B_Part_8 Q38_B_Part_9 Q38_B_Part_10 Q38_B_Part_11 Q38_B_OTHER
0 Duration (in seconds) What is your age (# years)? What is your gender? – Selected Choice In which country do you currently reside? What is the highest level of formal education … Select the title most similar to your current … For how many years have you been writing code … What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming language would you recommend … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following hosted notebook product… Which of the following hosted notebook product… Which of the following hosted notebook product… Which of the following hosted notebook product… Which of the following hosted notebook product… Which of the following hosted notebook product… … Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor…
1 910 50-54 Man India Bachelor’s degree Other 5-10 years Python R NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Python NaN NaN NaN NaN NaN NaN NaN NaN Vim / Emacs NaN NaN NaN NaN NaN Colab Notebooks NaN NaN NaN NaN … NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 784 50-54 Man Indonesia Master’s degree Program/Project Manager 20+ years NaN NaN SQL C C++ Java NaN NaN NaN NaN NaN NaN NaN Python NaN NaN NaN NaN NaN NaN Notepad++ NaN NaN NaN Jupyter Notebook NaN NaN Kaggle Notebooks Colab Notebooks NaN NaN NaN NaN … NaN NaN Qlik NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Automated model selection (e.g. auto-sklearn, … NaN NaN NaN NaN NaN Google Cloud AutoML NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None NaN
3 924 22-24 Man Pakistan Master’s degree Software Engineer 1-3 years Python NaN NaN NaN C++ Java NaN NaN NaN NaN NaN NaN NaN Python NaN NaN NaN NaN PyCharm NaN NaN NaN NaN NaN Jupyter Notebook NaN Other Kaggle Notebooks NaN NaN NaN NaN NaN … NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Automated model selection (e.g. auto-sklearn, … NaN NaN NaN NaN NaN NaN NaN NaN DataRobot AutoML NaN NaN NaN NaN NaN NaN NaN NaN TensorBoard NaN NaN NaN NaN NaN NaN NaN
4 575 45-49 Man Mexico Doctoral degree Research Scientist 20+ years Python NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Python NaN NaN NaN NaN NaN Spyder NaN NaN NaN NaN Jupyter Notebook NaN NaN NaN Colab Notebooks NaN NaN NaN NaN … NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None NaN NaN NaN NaN NaN NaN NaN None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None NaN
5 rows × 369 columns
Salaries.shape
(25974, 369)
# counting the number of null values in the target column
Salaries[‘Q25’].isna().sum()
Salaries.Q25.unique()
array([‘What is your current yearly compensation (approximate $USD)?’,
‘25,000-29,999’, ‘60,000-69,999’, ‘$0-999’, ‘30,000-39,999’, nan,
‘15,000-19,999’, ‘70,000-79,999’, ‘2,000-2,999’, ‘10,000-14,999’,
‘5,000-7,499’, ‘20,000-24,999’, ‘1,000-1,999’, ‘100,000-124,999’,
‘7,500-9,999’, ‘4,000-4,999’, ‘40,000-49,999’, ‘50,000-59,999’,
‘3,000-3,999’, ‘300,000-499,999’, ‘200,000-249,999’,
‘125,000-149,999’, ‘250,000-299,999’, ‘80,000-89,999’,
‘90,000-99,999’, ‘150,000-199,999’, ‘>$1,000,000’,
‘$500,000-999,999’], dtype=object)
Dropping rows with the missing target variable (Q25)¶
Salaries.dropna(subset=[‘Q25’],inplace=True)
Salaries.shape
(15392, 369)
Salaries.Q25.unique()
array([‘What is your current yearly compensation (approximate $USD)?’,
‘25,000-29,999’, ‘60,000-69,999’, ‘$0-999’, ‘30,000-39,999’,
‘15,000-19,999’, ‘70,000-79,999’, ‘2,000-2,999’, ‘10,000-14,999’,
‘5,000-7,499’, ‘20,000-24,999’, ‘1,000-1,999’, ‘100,000-124,999’,
‘7,500-9,999’, ‘4,000-4,999’, ‘40,000-49,999’, ‘50,000-59,999’,
‘3,000-3,999’, ‘300,000-499,999’, ‘200,000-249,999’,
‘125,000-149,999’, ‘250,000-299,999’, ‘80,000-89,999’,
‘90,000-99,999’, ‘150,000-199,999’, ‘>$1,000,000’,
‘$500,000-999,999’], dtype=object)
Combining the salary buckets and label encoding them¶
salary_buckets = { ‘$0-999’: ‘0-9,999’,
‘1,000-1,999’: ‘0-9,999’,
‘2,000-2,999’: ‘0-9,999’,
‘3,000-3,999’: ‘0-9,999’,
‘4,000-4,999’: ‘0-9,999’,
‘5,000-7,499’: ‘0-9,999’,
‘7,500-9,999’: ‘0-9,999’,
‘10,000-14,999’: ‘10,000-19,999’,
‘15,000-19,999’: ‘10,000-19,999’,
‘20,000-24,999’: ‘20,000-29,999’,
‘25,000-29,999’: ‘20,000-29,999’,
‘30,000-39,999’: ‘30,000-39,999’,
‘40,000-49,999’: ‘40,000-49,999’,
‘50,000-59,999’: ‘50,000-59,999’,
‘60,000-69,999’: ‘60,000-69,999’,
‘70,000-79,999’: ‘70,000-79,999’,
‘80,000-89,999’: ‘80,000-89,999’,
‘90,000-99,999’: ‘90,000-99,999’,
‘100,000-124,999′:’100,000-124,999’,
‘125,000-149,999′:’125,000-149,999’,
‘150,000-199,999′:’150,000-199,999’,
‘200,000-249,999′:’200,000-299,999’,
‘250,000-299,999′:’200,000-299,999’,
‘300,000-499,999′:’>300,000’,
‘$500,000-999,999′:’>300,000’,
‘>$1,000,000′:’>300,000’}
salary_encode = { ‘$0-999’: 0,
‘1,000-1,999’: 0,
‘2,000-2,999’: 0,
‘3,000-3,999’: 0,
‘4,000-4,999’: 0,
‘5,000-7,499’: 0,
‘7,500-9,999’: 0,
‘10,000-14,999’: 1,
‘15,000-19,999’: 1,
‘20,000-24,999’: 2,
‘25,000-29,999’: 2,
‘30,000-39,999’: 3,
‘40,000-49,999’: 4,
‘50,000-59,999’: 5,
‘60,000-69,999’: 6,
‘70,000-79,999’: 7,
‘80,000-89,999’: 8,
‘90,000-99,999’: 9,
‘100,000-124,999’:10,
‘125,000-149,999’:11,
‘150,000-199,999’:12,
‘200,000-249,999’:13,
‘250,000-299,999’:13,
‘300,000-499,999’:14,
‘$500,000-999,999’:14,
‘>$1,000,000′:14}
#Label Encoding the target variable
Salaries.loc[1:,’Q25_Encoded’] = Salaries.loc[1:,’Q25′].map(salary_encode)
Salaries.loc[1:,’Q25_Encoded’]=Salaries.loc[1:,’Q25_Encoded’].astype(int)
Salaries.Q25_Encoded.unique()
array([nan, 2., 6., 0., 3., 1., 7., 10., 4., 5., 14., 13., 11.,
8., 9., 12.])
Salaries[Salaries[‘Q25_Encoded’].isna()]
Time from Start to Finish (seconds) Q1 Q2 Q3 Q4 Q5 Q6 Q7_Part_1 Q7_Part_2 Q7_Part_3 Q7_Part_4 Q7_Part_5 Q7_Part_6 Q7_Part_7 Q7_Part_8 Q7_Part_9 Q7_Part_10 Q7_Part_11 Q7_Part_12 Q7_OTHER Q8 Q9_Part_1 Q9_Part_2 Q9_Part_3 Q9_Part_4 Q9_Part_5 Q9_Part_6 Q9_Part_7 Q9_Part_8 Q9_Part_9 Q9_Part_10 Q9_Part_11 Q9_Part_12 Q9_OTHER Q10_Part_1 Q10_Part_2 Q10_Part_3 Q10_Part_4 Q10_Part_5 Q10_Part_6 … Q34_B_Part_7 Q34_B_Part_8 Q34_B_Part_9 Q34_B_Part_10 Q34_B_Part_11 Q34_B_Part_12 Q34_B_Part_13 Q34_B_Part_14 Q34_B_Part_15 Q34_B_Part_16 Q34_B_OTHER Q36_B_Part_1 Q36_B_Part_2 Q36_B_Part_3 Q36_B_Part_4 Q36_B_Part_5 Q36_B_Part_6 Q36_B_Part_7 Q36_B_OTHER Q37_B_Part_1 Q37_B_Part_2 Q37_B_Part_3 Q37_B_Part_4 Q37_B_Part_5 Q37_B_Part_6 Q37_B_Part_7 Q37_B_OTHER Q38_B_Part_1 Q38_B_Part_2 Q38_B_Part_3 Q38_B_Part_4 Q38_B_Part_5 Q38_B_Part_6 Q38_B_Part_7 Q38_B_Part_8 Q38_B_Part_9 Q38_B_Part_10 Q38_B_Part_11 Q38_B_OTHER Q25_Encoded
0 Duration (in seconds) What is your age (# years)? What is your gender? – Selected Choice In which country do you currently reside? What is the highest level of formal education … Select the title most similar to your current … For how many years have you been writing code … What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming languages do you use on a reg… What programming language would you recommend … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following integrated development … Which of the following hosted notebook product… Which of the following hosted notebook product… Which of the following hosted notebook product… Which of the following hosted notebook product… Which of the following hosted notebook product… Which of the following hosted notebook product… … Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which of the following business intelligence t… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which categories of automated machine learning… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… Which specific automated machine learning tool… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… In the next 2 years, do you hope to become mor… NaN
1 rows × 370 columns
#Combining the salary buckets
Salaries.loc[1:,’Q25_buckets’] = Salaries.loc[1:,’Q25′].map(salary_buckets)
Salaries.Q25_buckets.unique()
array([nan, ‘20,000-29,999’, ‘60,000-69,999’, ‘0-9,999’, ‘30,000-39,999’,
‘10,000-19,999’, ‘70,000-79,999’, ‘100,000-124,999’,
‘40,000-49,999’, ‘50,000-59,999’, ‘>300,000’, ‘200,000-299,999’,
‘125,000-149,999’, ‘80,000-89,999’, ‘90,000-99,999’,
‘150,000-199,999’], dtype=object)
Salaries.shape
(15392, 371)
Salaries[‘Q25_Encoded’].value_counts(normalize=True)
0.0 0.454811
1.0 0.098954
2.0 0.068676
3.0 0.048145
10.0 0.047105
5.0 0.045286
4.0 0.044701
6.0 0.035800
7.0 0.030147
12.0 0.025469
8.0 0.025404
11.0 0.024625
9.0 0.022741
13.0 0.016373
14.0 0.011760
Name: Q25_Encoded, dtype: float64
Salaries.to_csv(“clean_kaggle_data_2021.csv”, index=False)
files.download(‘clean_kaggle_data_2021.csv’)
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com