R语言kaggle代写

Dataset Details

The data set was created from information collected from a Kaggle survey to examine the state of data science and machine learning from the views of more than 16000 individuals from over 171 different countries. The question bank consisted of approximately 200 questions, some questions were asked to all individuals while others were only asked to particular groups of people. Individuals were grouped into ‘learners’, ‘non- switcher’, ‘non-worker’, ‘worker’, and ‘coding worker’ based on their answers to current employment state, if they code for their job, if they are learning to code, and if they are looking to switch careers.

Objective

The objective of this project is to gain further knowledge about the data science environment in the United States.

Separate Dataset Into Their Respective Groups

The data set includes surveys from individuals across 171 countries. This project will focus on data on individuals in the United States. As mentioned previously, the questions asked to each individual was determined by how he/she answered questions about their employment, if their job requires coding, if they are thinking of switching careers, and if they are currently learning how to code. The schema file is a csv file containing columns labeled ‘Column’, ‘Question’, and ‘Asked’ which corresponded to the column label in the whole data set, the question asked to the individual, and who was asked respectively. The data set was broken into these groups (learners’, ‘non-switcher’, ‘non-worker’, ‘worker’, and ‘coding worker’) and only included the questions (columns) the groups were asked.

Examine the Age Distribution in Each Group

A question that was asked to every responded was “What is your age”. Below is a box plot and histogram of the age distribution for each group of people.

Age Distribution of the Groups

coding_wor

non_switc

non_wor

wor

lear

ker

coding_worker

non_switcher

non_worker

worker

learner

her

ker

ker

ner

0 20 40 60 80 100

150 100

coding_worker

50

non_switcher

non_switcher

50

0

non_worker worker learner

30 20 10

0

60 40 20

0

150 100 50

Central Limit Theorem

The central limit theorem states that the distribution of sample means, taken from independent random sample sizes, follows a normal distribution even if the original population is not normally distributed. This is important because there are a lot of statistical procedures that require normality in the data set. As a result we can apply statistical techniques that assume normality even when the population is non normal. Using the age attribute in this data set the applicability of the central limit theorem can be shown. As displayed in the box plot and histogram above, the age distribution of all groups have a positive skew. Since all these distributions follow a right skew, the coding workers will be used as an example to show the application of the central limit theorem. Below is are histograms showing the sample means of 1000 random samples of sample size 10, 20, 30, and 40 follow a normal distribution.

 ## population mean:  35.68992
 ## sample size:  10 mean:  35.5954 sd:  3.552371
 ## sample size:  20 mean:  35.7667 sd:  2.554598
 ## sample size:  30 mean:  35.6469 sd:  2.034189
 ## sample size:  40 mean:  35.72738 sd:  1.776585

0 60

40 20

00 20 40 60 80 100

60

40

20

0

100
 80
 60
 40
 20
  0

10 20 30 40

100

80 60 40 20

100

50

0

00

30 35 40 45 5

30 35 40 4

5

Sampling of Coding Worker via Simple Random Sample Without Replacement, Systematic Sampling, and Stratified Sampling

Sampling is a technique to select a representative portion of the population to perform a study on. There are many different sampling techniques including simple random sampling, systematic sampling, and stratified sampling. Simple random sampling is a basic sampling technique where individual subjects are selected from a larger group. In this case, every sample has the same chance of getting picked. Systematic sampling is a method where samples are selected via a fixed periodic interval. The interval is calculated by dividing the whole population sample by the desired sample size. The first sample is decided randomly within the first interval. Lastly, stratified sampling takes into the account that there is heterogeneity in a population. The population is subdivided into sub populations and the same percentage of individuals is selected from each sub population to make up the sample set. When looking at a normal distribution, the sample mean can be used as an estimate for the population mean. Given a certain confidence level, a confidence interval is defined. The confidence interval is range of values which contains the population mean with the given confidence level.

For this project the coding worker population with be analyzed. Simple random sampling without replacement, systematic sampling, and stratified sampling will be utilized as sampling methods.

0 0

0
0

0
0

0

 ## All Coding Worker: mean = 35.7 and sd = 0.21
 ## 80% Conf Level (alpha = 0.20), CI = 35.43 - 35.97
 ## 90% Conf Level (alpha = 0.10), CI = 35.35 - 36.05
 ## SRSWOR: mean = 35.38 and sd = 0.36
 ## 80% Conf Level (alpha = 0.20), CI = 34.92 - 35.84
 ## 90% Conf Level (alpha = 0.10), CI = 34.79 - 35.97
 ## systematic sampling: mean = 35.5 and sd = 0.37
 ## 80% Conf Level (alpha = 0.20), CI = 35.03 - 35.97
 ## 90% Conf Level (alpha = 0.10), CI = 34.89 - 36.11
 ## stratified sampling: mean = 35.37 and sd = 0.34
 ## 80% Conf Level (alpha = 0.20), CI = 34.93 - 35.81
 ## 90% Conf Level (alpha = 0.10), CI = 34.81 - 35.93

General Information of Coding Worker

Sampling is a great way to analyze a representative portion of the population without needing to evaluate the whole population. In many circumstances it is impossible to obtain data on the whole population and as a result sampling comes in very useful. Further, decreasing the number of subjects for analysis is also beneficial in that it will require less computational power. The data set used for this project can be seen as a sample of the whole coding worker population. However this sample may be skewed towards certain coding workers since the data was obtained only through the Kaggle website. Given that the data set has a manageable number of tuples the whole coding worker data set rather than a smaller sample size of the data set will be used to analyze coding workers.

The background information of the coding worker will be depicted to give some general idea of what kinds of people took this survey. Of particular interest are the following questions:
1. Select the option that’s most similar to your current job/professional title.

.04 .02 0

0.1 .05

All Coding Worker SRSWOR
Systematic Sampling

Stratified Sampling
by Job Title and Gender

0.1 .05

0.1 .05

00 20 40 60 80 100

2. Which level of formal education have you attained?
3. What programming language would you recommend a new data scientist learn first?

Select the option that’s most similar to your current job/professional title.

There were many job titles under the umbrella of ‘coding worker’. It is wise to know the distribution of the different types of professions to examine if there is a group that dominates the survey. Specifically, if there is a profession that dominates, then the answers for all the successive questions would be biased towards the mentality of that group.

CurrentJobTitleSelect

Data Analyst Data Scientist 10.7% 27.6%

Other 9.97%

Level of Formal Education of All Coding Worker?

Many people are applying for jobs everyday. It would be ideal to interview every applicant to increase your chances of getting the best applicant, however this approach is not feasible.Scanning resumes for keywords that pertains to the job at hand is a good way of shifting out those who may not have the technical expertise for the position. After passing the first step of screening, what distinguishes one applicant from another? Does having more degrees further boost your chances of getting a job? Although examining the educational background of these professionals may only show a correlative relationship between degrees and certain jobs, it is still interesting to see what the educational background of these professionals are.

FormalEducation

Bachelor’s degree 24.9%

Doctoral degree 24.3%

Master’s degree 44.9%

Software Developer/Software Engineer 11.5%

Operations Research Practitioner 0.88%

Data Miner 0.44%

Programmer 1.28%

DBA/Database Engineer 1.43%

Predictive Modeler 1.91%

Machine Learning Engineer 4.44%

Computer Scientist 2.31%

Statistician 2.86%

Scientist/Researcher 9.5%

Researcher 5.98%

Engineer 4.51%

I did not complete any formal educati 0.479%

Business Analyst 4.66%

Some college/university study without earning a bachelor 3.83%

Professional 1.29%

al degree 9%

r’s degree

tion past high school

Formal Education and Specified Profession

The above pie chart shows the formal education of all the coding workers. However, it is more informative to show the formal education distribution of each profession. Alternatively, it is also interesting to examine the profession of individuals with a certain formal education.

Professional degree

Machine Learning Engineer

Statistician

trace 6

I did not complete any formal education past high school CurrentJobTitleSelect

I prefer not to answer

Some college/university study without earning a bachelor’s degree

Doctoral degree

Bachelor’s degree

Master’s degree

1.5

1

0.5

0

trace 14

Software Developer/Software Engineer

Data Miner

Programmer Data Analyst Other Data Scientist

Business Analyst

8 6 4 2 0

Predictive Modeler

Researcher

Operations Research Practitioner

Scientist/Researcher

DBA/Database Engineer

Engineer

Computer Scientist

A Little Peek at Data Scientist’s Job Description

Data Science was deemed the ‘sexist job of the 21 century’ in the Harvard Business Review in 2012. It is a fairly new and evolving field and this survey provides some information about the new trends and ‘need to knows’ in the data science field. Some questions of interest include :
1. What language do they recommend new data scientists to learn?

I prefer not to answer 0.295%

FormalEducation

Computer Scientist Business Analyst

Some college/un

Data Miner Data Analyst

DBA/Database Engineer Data Scientist

Bachelor’s degree

Other
Operations Research Practitioner

Machine Learning Engineer Engineer

Doctoral degree

Programmer Predictive Modeler

Scientist/Researcher Researcher

I prefer not to answer

I did not complete any formal education past high school

Statisticia Software Developer/

Master’s degree

Professional degree

2. Are there any learning platforms they found useful for their career? 3. What sort of tools and algorithms do they use in their job?
4. What are some important skills that a Data Scientist should have?

What Language do Current Data Scientist Recommend New Data Scientist to Learn?

LanguageRecommendationSelect

Results: Python > R

SQL 5%

R 23.6%

Python 68.4%

Interestingly the majority of data scientists suggest learning Python over R. There can be many reasons that could explain this large gap between Python and R. First, maybe the python language is truly the predominate language to know in the data science field. Alternatively, there can be a sample size bias. This survey is taken by Kaggle which MAY be largely visited by data scientist who use python as their main or only language. As a result the results are skewed towards individuals who use python.

Whats the Best Way of Obtaining “Data Science” Skills?

Arxiv
Blogs
College Communities Company Conferences Courses Documentation Friends

Kaggle Newsletters Podcasts Projects

SO Textbook TradeBook Tutoring YouTube

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

SAS 1.03%

C/C++/C# 0.69%

Other 0.345%

Scala 0.345%

0.J1a7v2a%

0.J1u7li2a%

0M.1a7tl2a%b

0

0

Not Useful Somewhat useful Very useful

Is doing Projects the way to go?

What Sort of Tools and Algorithms do Data Scientists Use in their Job?

A very simplified description of a data science role includes the collection, analysis, and interpretation of copious amounts of data. Many data scientist utilize different tools and algorithms to analyze and interpret their gathered data. Below shows the likelihood at which some common algorithms and tools are used by data scientist.

Algorithms

RandomForests RecommenderSystems RNNs
Segmentation
Select1

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

Very useful Somewhat useful Not Useful

coding_worker non_worker worker
learner

0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

AssociationRules Bayesian
CNNs CollaborativeFiltering DataVisualization DecisionTrees EnsembleMethods EvolutionaryApproaches GANs

GBM
HMMs
KNN
LiftAnalysis LogisticRegression MLN

NaiveBayes NeuralNetworks
NLP
PCA PrescriptiveModeling

Most of the time Often Rarely Sometimes

Tools

0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

AmazonML Angoss AWS Azure

C
Cloudera DataRobot
Excel
Flume
GCP
Hadoop IBMCognos IBMSPSSModeler IBMSPSSStatistics IBMWatson Impala
Java
Julia
Jupyter KNIMECommercial

Most of the time Often Rarely Sometimes

What are Some Important Skills that a Data Scientist Should

Have?

Minitab NoSQL Oracle Orange

rl P

Q
R
R
R
S
S
S
S
S SJ Select1 Select2 Spark
SQL
Stan Statistica Tableau TensorFlow TIBCO Unix

KNIMEFree

Mathematica

MATLAB

MicrosoftRServer

MicrosoftSQL

Pe

1

0.8

0.6

0.4

0.2

0

Necessary Nice to have Unnecessary

ython BigData

lik

apidMinerCommercial KaggleRanking

apidMinerFree MOOC

alfrod Python

APBusinessObjects R

ASBase SQL

ASEnterprise Stats

A MP Visualizations

Degree

EnterpriseTools

Observations

100 percent of data scientist who answered these questions said ‘BigData’, SQL, and visualization is something a data scientist must know. However, this data seems to be contradictory of what we have seen so far. Showing percentage data can be very deceiving. The question is how many people are answering these questions? Below is a histogram showing the number of people who actually answered the job skill questions.

3

BigData

3

2.5

2

1.5

1

0.5

0

Necessary Nice to have Unnecessary

As shown from the histogram above less than 5 people out of over 700 data scientist answered these questions. It is not surprising that many people did not answer all the questions, especially given that there were over 200 questions in the question bank. The n for this data set is too low to make any real conclusions.

Conclusion

Below is a word cloud depicting the importance of a word based on how frequently it was selected as used “most of the time” or is “very useful”. As you can see words like python, data visualization, R, SQL, and projects are emphasized in this word cloud.

Degree EnterpriseTools KaggleRanking MOOC
Python
R
SQL
Stats Visualizations