Sample_FinalPresentation
VALUE OF A COLLEGE DEGREE
A DATA MINING APPROACH
EM623
DATA SCIENCE AND KNOWLEDGE DISCOVERY
JASON WONG
INTRODUCTION
• It seems like everyone these days is going to school, in school, or plans to go
back to school
• Why?
• Self-improvement
• Cultural norm
• Economic mobility
• One of the main reasons people pay money to go to school…
• Is so that they can make more money!
BUSINESS UNDERSTANDING
• If higher education is considered as a means of economic mobility and
improvement…
• A college degree should be considered an investment in yourself
• Like all investments…
• There are good investments that pay off
• And there are bad investments that don’t
• How do we determine the value of a college degree, whether it is a good or a bad
investment?
DATA UNDERSTANDING
• Dataset
• U.S. Department of Education
• College Scorecard Data
• Purpose: Increase transparency in higher education, allow public to see and observe how well
different schools serve their students
• https://collegescorecard.ed.gov/data/
• Dataset Specifications
• 6586 observations representing institutions
• 1729 variables or statistics regarding each institution
DATA UNDERSTANDING
• 1729 variables is a lot of variables!
• Most of them do not contain relevant information
• Many of them are constants, contain only null values, etc.
• To understand which types of variables are important…
• Investigate the dataset appendix and documentation
• Understand the context and business case
• Skim the data to observe the quality of the data
DATA UNDERSTANDING
• What types of variables are important?
• Financials (tuition cost, graduate earnings)
• Profile (location, population)
• Expertise (type of school, degrees awarded)
• Faculty (number, pay, ratio)
• What types of variables are not important?
• Low Quality (50%+ are NULL, NA, or blank)
• Unrelated (ethnicity of student body, number of deaths on campus)
• Too Specific (10th percentile SAT scores of Hispanic-heritage students)
DATA PREPARATION
• Remove data that obviously does not say anything
• Only one value, a constant
• Using Rattle, remove variables
• 1729 variables → 1479 variables
• Remove data that does not meet acceptable quality standards
• Over ~30%+ values are NULL, NA, Classified or blank
• Using Excel, replace filler cells with blank cells
• Using Rattle, identify variables that have many missing observations and remove them from the dataset
• 1479 variables → 312 variables
DATA PREPARATION
• Remove data that is unrelated to the business case
• Percent of high income students who died within 2 years of attending institution
• Percent of middle income students who transferred from a 4 year university whose status
is unknown
• Read through each specific variable and delete it manually in Rattle
• 312 variables → 56 variables
DATA PREPARATION
• Remove highly correlated variables (i.e. variables that support the same
conclusions or information)
• Run correlation analysis
• Analyze correlation matrix
• Remove variables with identical or highly similar correlations with the rest of the dataset
• 56 variables → 43 variables
CORRELATION MATRIX (INITIAL)
DATA PREPARATION
• Correlated Variables
• Interesting to see what variables had identical / similar correlational relationships with
the rest of the dataset
• Similar variables are not distinguishable from each other…
• In the context of relationships to the variables in the dataset
• Tuition Fee and Expenditure per Student
• As tuition fee increases, expenditure per student does as well
• Institutions that charge more… spend more per student! Not surprising
DATA PREPARATION
• Median and Mean Earnings 6 Years after Enrollment
• Expected that these would be closely correlated to one another
• Average Faculty Salary and Undergraduate Enrollment Number
• Interesting that as the number of undergraduates increases, faculty gets paid more
• Factors behind salary increase?
• More administrative overhead
• Coaches, athletics, more amenities at larger institutions
DATA PREPARATION
• Similar Degrees
• Psychology, Math, Social Sciences, and History
• Highly similar correlations, all have a similar effect in the context of financial outcomes of
colleges
• Student Median Debt and Degree Levels
• As student median debt from an institution increases, the degree level most commonly
received from the institution rises
• In this case, further education is not necessarily financially helpful
DATA PREPARATION
• Target Variable
• Is a degree from a given institution financially wise?
• Involves several different variables
• Median Student Earnings 6 Years after Enrollment (yearly)
• Average Tuition Fee (semesterly)
• Debt Possessed (after college)
• Rough formula to derive target variable:
• (1.05 * (Debt + (8 * Tuition) ) ) / (Earnings * 0.4)
DATA PREPARATION
• Method behind the madness…
• (1.05 * (Debt + (8 * Tuition) ) ) / (Earnings * 0.4)
• Interest Rate (generously low, non-compounded)
• Four years of college
• Realistic yearly disposable income
• Result: Approximate Number of Years to Pay Back Total Debt
• Possible Values…
• 0 → > 10 years required, not a wise financial decision
• 1 → < 10 years required, sound financial decision
• Variable titled WORTH
MODELING
• Supervised Learning
• Identified objective, target variable WORTH
• Decision Tree Model
• Identify institution characteristics that makes a degree from a given institution WORTH or
!WORTH
• Model Parameters
• Min Split = 500
• Min Bucket = 100
• Max Depth = 30
• Complexity = 0.0010
MODELING
EVALUATION
• So we have the model…
• Does it mean anything?
• Is it accurate and reliable?
• Error Matrix
• Relatively decent at predicting institutions with a good return
• Not so accurate at predicting institutions with a poor return
• Overall error: 0.1714, can be improved
EVALUATION
• Receiver Operating Characteristic (ROC) Curve Plot
• Visualize the model’s accuracy
• Plots the model’s performance along true and false positives
• Greater area under the curve (AUC) = better model performance
• AUC = 0.84
• Good, but not “too good to be true”
• Very high values of AUC indicate explanatory variables for the target variable are used
within the model
EVALUATION
CONCLUSION
• Model Rule Conclusions
• Public Institutions → always worth
• Institutions with ~1% graduates in Visual and Performing Arts → at risk
• Institutions with ~1% graduates in Culinary Services → at risk
• If located in the Northeast, out of luck
• Private institutions with < 1% graduates in Visual and Performing Arts and Culinary
Services
• Not at risk
LOOKING AHEAD
• Somewhat limited practical applicability
• Subjective derivation of WORTH variable
• Many variable fields low quality or unrelated
• Greater range of useful variables…
• Would increase relevance of input variables
• Could introduce new types of variables
• Overall
• An interesting case study that was time-consuming, but worth looking in to