Dataset Details
The data set was created from information collected from a Kaggle survey to examine the state of data science and machine learning from the views of more than 16000 individuals from over 171 different countries. The question bank consisted of approximately 200 questions, some questions were asked to all individuals while others were only asked to particular groups of people. Individuals were grouped into ‘learners’, ‘non- switcher’, ‘non-worker’, ‘worker’, and ‘coding worker’ based on their answers to current employment state, if they code for their job, if they are learning to code, and if they are looking to switch careers.
Objective
The objective of this project is to gain further knowledge about the data science environment in the United States.
Separate Dataset Into Their Respective Groups
The data set includes surveys from individuals across 171 countries. This project will focus on data on individuals in the United States. As mentioned previously, the questions asked to each individual was determined by how he/she answered questions about their employment, if their job requires coding, if they are thinking of switching careers, and if they are currently learning how to code. The schema file is a csv file containing columns labeled ‘Column’, ‘Question’, and ‘Asked’ which corresponded to the column label in the whole data set, the question asked to the individual, and who was asked respectively. The data set was broken into these groups (learners’, ‘non-switcher’, ‘non-worker’, ‘worker’, and ‘coding worker’) and only included the questions (columns) the groups were asked.
Examine the Age Distribution in Each Group
A question that was asked to every responded was “What is your age”. Below is a box plot and histogram of the age distribution for each group of people.
Age Distribution of the Groups
coding_wor
non_switc
non_wor
wor
lear
ker
coding_worker
non_switcher
non_worker
worker
learner
her
ker
ker
ner
0 20 40 60 80 100
150 100
coding_worker
50
non_switcher
non_switcher
50
0
non_worker worker learner
30 20 10
0
60 40 20
0
150 100 50
Central Limit Theorem
The central limit theorem states that the distribution of sample means, taken from independent random sample sizes, follows a normal distribution even if the original population is not normally distributed. This is important because there are a lot of statistical procedures that require normality in the data set. As a result we can apply statistical techniques that assume normality even when the population is non normal. Using the age attribute in this data set the applicability of the central limit theorem can be shown. As displayed in the box plot and histogram above, the age distribution of all groups have a positive skew. Since all these distributions follow a right skew, the coding workers will be used as an example to show the application of the central limit theorem. Below is are histograms showing the sample means of 1000 random samples of sample size 10, 20, 30, and 40 follow a normal distribution.
## population mean: 35.68992
## sample size: 10 mean: 35.5954 sd: 3.552371 ## sample size: 20 mean: 35.7667 sd: 2.554598 ## sample size: 30 mean: 35.6469 sd: 2.034189 ## sample size: 40 mean: 35.72738 sd: 1.776585
0 60
40 20
00 20 40 60 80 100
60
40
20
0
100 80 60 40 20 0
10 20 30 40
100
80 60 40 20
100
50
0
00
30 35 40 45 5
30 35 40 4
5
Sampling of Coding Worker via Simple Random Sample Without Replacement, Systematic Sampling, and Stratified Sampling
Sampling is a technique to select a representative portion of the population to perform a study on. There are many different sampling techniques including simple random sampling, systematic sampling, and stratified sampling. Simple random sampling is a basic sampling technique where individual subjects are selected from a larger group. In this case, every sample has the same chance of getting picked. Systematic sampling is a method where samples are selected via a fixed periodic interval. The interval is calculated by dividing the whole population sample by the desired sample size. The first sample is decided randomly within the first interval. Lastly, stratified sampling takes into the account that there is heterogeneity in a population. The population is subdivided into sub populations and the same percentage of individuals is selected from each sub population to make up the sample set. When looking at a normal distribution, the sample mean can be used as an estimate for the population mean. Given a certain confidence level, a confidence interval is defined. The confidence interval is range of values which contains the population mean with the given confidence level.
For this project the coding worker population with be analyzed. Simple random sampling without replacement, systematic sampling, and stratified sampling will be utilized as sampling methods.
0 0
0
0
0
0
0
## All Coding Worker: mean = 35.7 and sd = 0.21 ## 80% Conf Level (alpha = 0.20), CI = 35.43 - 35.97 ## 90% Conf Level (alpha = 0.10), CI = 35.35 - 36.05 ## SRSWOR: mean = 35.38 and sd = 0.36 ## 80% Conf Level (alpha = 0.20), CI = 34.92 - 35.84 ## 90% Conf Level (alpha = 0.10), CI = 34.79 - 35.97 ## systematic sampling: mean = 35.5 and sd = 0.37 ## 80% Conf Level (alpha = 0.20), CI = 35.03 - 35.97 ## 90% Conf Level (alpha = 0.10), CI = 34.89 - 36.11 ## stratified sampling: mean = 35.37 and sd = 0.34 ## 80% Conf Level (alpha = 0.20), CI = 34.93 - 35.81 ## 90% Conf Level (alpha = 0.10), CI = 34.81 - 35.93
General Information of Coding Worker
Sampling is a great way to analyze a representative portion of the population without needing to evaluate the whole population. In many circumstances it is impossible to obtain data on the whole population and as a result sampling comes in very useful. Further, decreasing the number of subjects for analysis is also beneficial in that it will require less computational power. The data set used for this project can be seen as a sample of the whole coding worker population. However this sample may be skewed towards certain coding workers since the data was obtained only through the Kaggle website. Given that the data set has a manageable number of tuples the whole coding worker data set rather than a smaller sample size of the data set will be used to analyze coding workers.
The background information of the coding worker will be depicted to give some general idea of what kinds of people took this survey. Of particular interest are the following questions:
1. Select the option that’s most similar to your current job/professional title.
.04 .02 0 |
0.1 .05
All Coding Worker SRSWOR
Systematic Sampling
Stratified Sampling
by Job Title and Gender
0.1 .05
0.1 .05
00 20 40 60 80 100
2. Which level of formal education have you attained?
3. What programming language would you recommend a new data scientist learn first?
Select the option that’s most similar to your current job/professional title.
There were many job titles under the umbrella of ‘coding worker’. It is wise to know the distribution of the different types of professions to examine if there is a group that dominates the survey. Specifically, if there is a profession that dominates, then the answers for all the successive questions would be biased towards the mentality of that group.
CurrentJobTitleSelect
Data Analyst Data Scientist 10.7% 27.6%
Other 9.97%
Level of Formal Education of All Coding Worker?
Many people are applying for jobs everyday. It would be ideal to interview every applicant to increase your chances of getting the best applicant, however this approach is not feasible.Scanning resumes for keywords that pertains to the job at hand is a good way of shifting out those who may not have the technical expertise for the position. After passing the first step of screening, what distinguishes one applicant from another? Does having more degrees further boost your chances of getting a job? Although examining the educational background of these professionals may only show a correlative relationship between degrees and certain jobs, it is still interesting to see what the educational background of these professionals are.
FormalEducation
Bachelor’s degree 24.9%
Doctoral degree 24.3%
Master’s degree 44.9%
Software Developer/Software Engineer 11.5%
Operations Research Practitioner 0.88%
Data Miner 0.44%
Programmer 1.28%
DBA/Database Engineer 1.43%
Predictive Modeler 1.91%
Machine Learning Engineer 4.44%
Computer Scientist 2.31%
Statistician 2.86%
Scientist/Researcher 9.5%
Researcher 5.98%
Engineer 4.51%
I did not complete any formal educati 0.479%
Business Analyst 4.66%
Some college/university study without earning a bachelor 3.83%
Professional 1.29%
al degree 9%
r’s degree
tion past high school
Formal Education and Specified Profession
The above pie chart shows the formal education of all the coding workers. However, it is more informative to show the formal education distribution of each profession. Alternatively, it is also interesting to examine the profession of individuals with a certain formal education.
Professional degree
Machine Learning Engineer
Statistician
trace 6
I did not complete any formal education past high school CurrentJobTitleSelect
I prefer not to answer
Some college/university study without earning a bachelor’s degree
Doctoral degree
Bachelor’s degree
Master’s degree
1.5
1
0.5
0
trace 14
Software Developer/Software Engineer
Data Miner
Programmer Data Analyst Other Data Scientist
Business Analyst
8 6 4 2 0
Predictive Modeler
Researcher
Operations Research Practitioner
Scientist/Researcher
DBA/Database Engineer
Engineer
Computer Scientist
A Little Peek at Data Scientist’s Job Description
Data Science was deemed the ‘sexist job of the 21 century’ in the Harvard Business Review in 2012. It is a fairly new and evolving field and this survey provides some information about the new trends and ‘need to knows’ in the data science field. Some questions of interest include :
1. What language do they recommend new data scientists to learn?
I prefer not to answer 0.295%
FormalEducation
Computer Scientist Business Analyst
Some college/un
Data Miner Data Analyst
DBA/Database Engineer Data Scientist
Bachelor’s degree
Other
Operations Research Practitioner
Machine Learning Engineer Engineer
Doctoral degree
Programmer Predictive Modeler
Scientist/Researcher Researcher
I prefer not to answer
I did not complete any formal education past high school
Statisticia Software Developer/
Master’s degree
Professional degree
2. Are there any learning platforms they found useful for their career? 3. What sort of tools and algorithms do they use in their job?
4. What are some important skills that a Data Scientist should have?
What Language do Current Data Scientist Recommend New Data Scientist to Learn?
LanguageRecommendationSelect
Results: Python > R
SQL 5%
R 23.6%
Python 68.4%
Interestingly the majority of data scientists suggest learning Python over R. There can be many reasons that could explain this large gap between Python and R. First, maybe the python language is truly the predominate language to know in the data science field. Alternatively, there can be a sample size bias. This survey is taken by Kaggle which MAY be largely visited by data scientist who use python as their main or only language. As a result the results are skewed towards individuals who use python.
Whats the Best Way of Obtaining “Data Science” Skills?
Arxiv
Blogs
College Communities Company Conferences Courses Documentation Friends
Kaggle Newsletters Podcasts Projects
SO Textbook TradeBook Tutoring YouTube
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
SAS 1.03%
C/C++/C# 0.69%
Other 0.345%
Scala 0.345%
0.J1a7v2a%
0.J1u7li2a%
0M.1a7tl2a%b
0
0
Not Useful Somewhat useful Very useful
Is doing Projects the way to go?
What Sort of Tools and Algorithms do Data Scientists Use in their Job?
A very simplified description of a data science role includes the collection, analysis, and interpretation of copious amounts of data. Many data scientist utilize different tools and algorithms to analyze and interpret their gathered data. Below shows the likelihood at which some common algorithms and tools are used by data scientist.
Algorithms
RandomForests RecommenderSystems RNNs
Segmentation
Select1
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
Very useful Somewhat useful Not Useful
coding_worker non_worker worker
learner
0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
AssociationRules Bayesian
CNNs CollaborativeFiltering DataVisualization DecisionTrees EnsembleMethods EvolutionaryApproaches GANs
GBM
HMMs
KNN
LiftAnalysis LogisticRegression MLN
NaiveBayes NeuralNetworks
NLP
PCA PrescriptiveModeling
Most of the time Often Rarely Sometimes
Tools
0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
AmazonML Angoss AWS Azure
C
Cloudera DataRobot
Excel
Flume
GCP
Hadoop IBMCognos IBMSPSSModeler IBMSPSSStatistics IBMWatson Impala
Java
Julia
Jupyter KNIMECommercial
Most of the time Often Rarely Sometimes
What are Some Important Skills that a Data Scientist Should
Have?
Minitab NoSQL Oracle Orange
rl P
Q
R
R
R
S
S
S
S
S SJ Select1 Select2 Spark
SQL
Stan Statistica Tableau TensorFlow TIBCO Unix
KNIMEFree
Mathematica
MATLAB
MicrosoftRServer
MicrosoftSQL
Pe
1
0.8
0.6
0.4
0.2
0
Necessary Nice to have Unnecessary
ython BigData
lik
apidMinerCommercial KaggleRanking
apidMinerFree MOOC
alfrod Python
APBusinessObjects R
ASBase SQL
ASEnterprise Stats
A MP Visualizations
Degree
EnterpriseTools
Observations
100 percent of data scientist who answered these questions said ‘BigData’, SQL, and visualization is something a data scientist must know. However, this data seems to be contradictory of what we have seen so far. Showing percentage data can be very deceiving. The question is how many people are answering these questions? Below is a histogram showing the number of people who actually answered the job skill questions.
3
BigData
3
2.5
2
1.5
1
0.5
0
Necessary Nice to have Unnecessary
As shown from the histogram above less than 5 people out of over 700 data scientist answered these questions. It is not surprising that many people did not answer all the questions, especially given that there were over 200 questions in the question bank. The n for this data set is too low to make any real conclusions.
Conclusion
Below is a word cloud depicting the importance of a word based on how frequently it was selected as used “most of the time” or is “very useful”. As you can see words like python, data visualization, R, SQL, and projects are emphasized in this word cloud.
Degree EnterpriseTools KaggleRanking MOOC
Python
R
SQL
Stats Visualizations