程序代写代做代考 data mining Excel case study decision tree data science Sample_FinalPresentation

Sample_FinalPresentation

VALUE OF A COLLEGE DEGREE
A DATA MINING APPROACH

EM623

DATA SCIENCE AND KNOWLEDGE DISCOVERY

JASON WONG

INTRODUCTION

• It seems like everyone these days is going to school, in school, or plans to go
back to school

• Why?
• Self-improvement
• Cultural norm
• Economic mobility

• One of the main reasons people pay money to go to school…
• Is so that they can make more money!

BUSINESS UNDERSTANDING

• If higher education is considered as a means of economic mobility and
improvement…
• A college degree should be considered an investment in yourself

• Like all investments…
• There are good investments that pay off
• And there are bad investments that don’t

• How do we determine the value of a college degree, whether it is a good or a bad
investment?

DATA UNDERSTANDING

• Dataset
• U.S. Department of Education

• College Scorecard Data
• Purpose: Increase transparency in higher education, allow public to see and observe how well

different schools serve their students
• https://collegescorecard.ed.gov/data/

• Dataset Specifications
• 6586 observations representing institutions
• 1729 variables or statistics regarding each institution

DATA UNDERSTANDING

• 1729 variables is a lot of variables!
• Most of them do not contain relevant information
• Many of them are constants, contain only null values, etc.

• To understand which types of variables are important…
• Investigate the dataset appendix and documentation
• Understand the context and business case
• Skim the data to observe the quality of the data

DATA UNDERSTANDING

• What types of variables are important?
• Financials (tuition cost, graduate earnings)
• Profile (location, population)
• Expertise (type of school, degrees awarded)
• Faculty (number, pay, ratio)

• What types of variables are not important?
• Low Quality (50%+ are NULL, NA, or blank)
• Unrelated (ethnicity of student body, number of deaths on campus)
• Too Specific (10th percentile SAT scores of Hispanic-heritage students)

DATA PREPARATION

• Remove data that obviously does not say anything
• Only one value, a constant
• Using Rattle, remove variables
• 1729 variables → 1479 variables

• Remove data that does not meet acceptable quality standards
• Over ~30%+ values are NULL, NA, Classified or blank
• Using Excel, replace filler cells with blank cells
• Using Rattle, identify variables that have many missing observations and remove them from the dataset
• 1479 variables → 312 variables

DATA PREPARATION

• Remove data that is unrelated to the business case
• Percent of high income students who died within 2 years of attending institution
• Percent of middle income students who transferred from a 4 year university whose status

is unknown

• Read through each specific variable and delete it manually in Rattle
• 312 variables → 56 variables

DATA PREPARATION

• Remove highly correlated variables (i.e. variables that support the same
conclusions or information)
• Run correlation analysis
• Analyze correlation matrix
• Remove variables with identical or highly similar correlations with the rest of the dataset
• 56 variables → 43 variables

CORRELATION MATRIX (INITIAL)

DATA PREPARATION

• Correlated Variables
• Interesting to see what variables had identical / similar correlational relationships with

the rest of the dataset

• Similar variables are not distinguishable from each other…
• In the context of relationships to the variables in the dataset

• Tuition Fee and Expenditure per Student
• As tuition fee increases, expenditure per student does as well
• Institutions that charge more… spend more per student! Not surprising

DATA PREPARATION

• Median and Mean Earnings 6 Years after Enrollment
• Expected that these would be closely correlated to one another

• Average Faculty Salary and Undergraduate Enrollment Number
• Interesting that as the number of undergraduates increases, faculty gets paid more
• Factors behind salary increase?

• More administrative overhead
• Coaches, athletics, more amenities at larger institutions

DATA PREPARATION

• Similar Degrees
• Psychology, Math, Social Sciences, and History
• Highly similar correlations, all have a similar effect in the context of financial outcomes of

colleges

• Student Median Debt and Degree Levels
• As student median debt from an institution increases, the degree level most commonly

received from the institution rises

• In this case, further education is not necessarily financially helpful

DATA PREPARATION

• Target Variable
• Is a degree from a given institution financially wise?

• Involves several different variables
• Median Student Earnings 6 Years after Enrollment (yearly)
• Average Tuition Fee (semesterly)
• Debt Possessed (after college)

• Rough formula to derive target variable:
• (1.05 * (Debt + (8 * Tuition) ) ) / (Earnings * 0.4)

DATA PREPARATION

• Method behind the madness…
• (1.05 * (Debt + (8 * Tuition) ) ) / (Earnings * 0.4)

• Interest Rate (generously low, non-compounded)
• Four years of college
• Realistic yearly disposable income

• Result: Approximate Number of Years to Pay Back Total Debt
• Possible Values…

• 0 → > 10 years required, not a wise financial decision
• 1 → < 10 years required, sound financial decision • Variable titled WORTH MODELING • Supervised Learning • Identified objective, target variable WORTH • Decision Tree Model • Identify institution characteristics that makes a degree from a given institution WORTH or !WORTH • Model Parameters • Min Split = 500 • Min Bucket = 100 • Max Depth = 30 • Complexity = 0.0010 MODELING EVALUATION • So we have the model… • Does it mean anything? • Is it accurate and reliable? • Error Matrix • Relatively decent at predicting institutions with a good return • Not so accurate at predicting institutions with a poor return • Overall error: 0.1714, can be improved EVALUATION • Receiver Operating Characteristic (ROC) Curve Plot • Visualize the model’s accuracy • Plots the model’s performance along true and false positives • Greater area under the curve (AUC) = better model performance • AUC = 0.84 • Good, but not “too good to be true” • Very high values of AUC indicate explanatory variables for the target variable are used within the model EVALUATION CONCLUSION • Model Rule Conclusions • Public Institutions → always worth • Institutions with ~1% graduates in Visual and Performing Arts → at risk • Institutions with ~1% graduates in Culinary Services → at risk • If located in the Northeast, out of luck • Private institutions with < 1% graduates in Visual and Performing Arts and Culinary Services • Not at risk LOOKING AHEAD • Somewhat limited practical applicability • Subjective derivation of WORTH variable • Many variable fields low quality or unrelated • Greater range of useful variables… • Would increase relevance of input variables • Could introduce new types of variables • Overall • An interesting case study that was time-consuming, but worth looking in to