Text Preprocessing Pipeline
Regression: Introduction & Linear Regression
Ch.4 Multivariate Data Analysis. Joseph Hair et al. 2010. Pearson
Ch.6. Learn R for Applied Statistics. Eric Hui. 2018. Apress
Ch.2 Regression Analysis. William Mendenhall and Terry Sincich. 2012. 7th edition. Pearson
Ch.7. Simple Linear Regression. David Dalpiaz. 2019
Regression in Applied Statistics
Hypothesis: null (H0) and alternative (HA)
Descriptive Statistics
Derives dataset summary:
– central tendency
– dispersion
– skewness
Inferential Statistics
Makes inference about the population
Use hypothesis testing and parameter estimation
Inference Test
Reject
p < 0.05 (alpha)
p > 0.05 (alpha)
Fail to Reject
Regression:
a set of statistical processes to estimate the relationships between all the variables
In week 3 we talked about Descriptive statistics derives a summary from the data set and makes use of central tendency, dispersion, and skewness.
Inferential statistics describes and makes inferences about the population from the sampled data. In inferential statistics, you use hypothesis testing and estimating of parameters.
In hypothesis testing, you try to answer a research question
Based on the research question, the hypothesis can be a null hypothesis, H0, and an alternate hypothesis, Ha.
You can then use inference tests to get the p-value. If the p-value is less than or equal to alpha, which is usually 0.05, you reject the null hypothesis and say that the alternate hypothesis is true at the 95% confidence interval. If the p-value is more than 0.05, you fail to reject the null hypothesis.
Regression analysis is a set of statistical processes to estimate the relationships between all the variables. To be more specific, regression analysis is used to understand the relationships among independent variables and dependent variables and to explore the forms of the relationships
2
Model
The variable to be predicted (or modeled), y, is called the
dependent (or response) variable
The variables used to predict (or model) y are called independent
variables and are denoted by the symbols x1, x2, x3
(Mendenhall, 2012)
(beta one) = Slope of the line [amount of increase (or decrease)
in the mean of y for every 1-unit increase in x
(beta zero) = y-intercept of the line [the line
intercepts the y-axis]
In regression, the variable y to be modeled is called the dependent (or response) variable.
This model is called a probabilistic model for y. when certain assumptions about the model are satisfied, we can make a probability statement about the magnitude of the deviation between y and E(y) – expected value of y.
we will need to use sample data to estimate the parameters of the probabilistic model—namely, the mean E(y) and the random error ε. You could think of this a number of ways:
Response = Prediction + Error
Response = Signal + Noise
Response = Model + Unexplained
Response = Deterministic + Random
Response = Explainable + Unexplainable
to estimate E(y) we need to find the mathematical model that relates y to a set of independent variables and best fits the data is part of the process known as regression analysis
The variables used to predict (or model) y are called independent variables and are denoted by the symbols x1, x2, x3
So we would like to model the relationship between X and Y using the form – The function f describes the functional relationship between the two variables.
We are going to restrict Function of X to linear function using beta zero for intrecept and beta 1 for the slope
3
Regression Types
Independent Variables
Regression Line Shape
Dependent variable
Simple
Multiple
1 Independent
> 1 Independent
Linear
Continuous
Linear
Ridge
Highly correlated
Logistic
Binary
Nominal
> 2 categories
Poisson
Count
Quadratic
Curvilinear
Logistic
Stepwise
Identification of best variables
Lasso
Ridge with variable selection
Ordinal
Ordered response
Multivariate
> 1 dependent
There are various kinds of regression techniques available to make predictions. These techniques are mostly driven by three metrics (number of independent variables, type of dependent variables and shape of regression line).
When you have only 1 independent variable and 1 dependent variable, it is called simple linear regression.
When you have more than 1 independent variable and 1 dependent variable, it is called Multiple linear regression.
Linear regression – the dependent variable is continuous in nature. The relationship between the dependent variable and independent variables is assumed to be linear in nature.
We should use logistic regression when the dependent variable is binary (0/ 1, True/ False, Yes/ No) in nature.
Ridge Regression is a technique used when the data suffers from multicollinearity ( independent variables are highly correlated). In multicollinearity, even though the least squares estimates (OLS) are unbiased, their variances are large which deviates the observed value far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors.
Multinomial Logistic Regression is a model with multinomial response -more than 2 categories
Poisson is used when with discrete data with non-negative integer values that count something.
Stepwise regression and Best subsets regression: These automated methods can help identify candidate variables early in the model specification process.
Lasso regression (least absolute shrinkage and selection operator) performs variable selection that aims to increase prediction accuracy by identifying a simpler model. It is similar to Ridge regression but with variable selection.
models the relationship between a set of predictors and an ordinal response variable. An ordinal response has at least three groups which have a natural order, such as hot, medium, and cold.
Multivariate Regression is a method with more than one dependent variable (responses).
4
Key Terms: Error Types
The level of risk we accept in making a wrong decision about a null hypothesis
Null is true Null is false
Reject null Type I error (False Positive) Right decision
Retain null Right decision Type II error (False Negative)
α (alpha)
β (beta)
The probability of committing Type II error
0.05, 0.01, 0.001
Level of significance
When α is set to 0.05, p values < 0.05 indicate significance
Inferential statistics involve making decisions about hypotheses.
The level of risk we accept in making a wrong decision about a null hypothesis is symbolized as alpha (α) . Making a wrong decision to reject the null when it is in fact true is known as Type I error, and in this context the word error does mean a mistake. We are commonly satisfied with levels of risk of error equal to .05, .01, and .001% so Where α has been set of 0.05, p values of under 0.05 indicate significant differences or relationships
Type II error is retaining the null when it was in fact false, and the probability of committing Type II error is symbolized as beta (β). A common
Beta Probability that a significant relationship will be found if it actually exists
5
Simple Linear Regression
Random unobserved variable: ϵI - independently and identically distributed (iid) normal random error variables
(David Dalpiaz, 2019)
Y - Response
X - Predictor
Fixed unknown parameters: β0 β1, and σ2
Fixed known constant: xi
Random variable: Yi and their possible values yi
Note: for each x the y-values spread about the mean E(y) and with a standard deviation σ that is the same for every value of x.
(Shaffer and Zhang, 2019. Introductory Statistics)
Simple: y depends on only one other variable
Let’s consider a simple example of how the speed of a car affects its stopping distance, that is, how far it travels before it comes to a stop. So we are interested in using the predictor variable speed to predict and explain the response variable distance.
Let’s define the model for this data.
In this model we have xi as a fixed know constant (and it is notated in a lower case)
This model has three fixed unknown constants - parameters to be estimated: β0β0, β1β1, and σ2σ2
the ϵiϵi are independent and identically distributed (iid) normal random variables with mean 00 and variance σ2σ2.
Recall that we use capital Y to indicate a random variable, and lower case y to denote a potential value of the random variable. Since we will have n observations, we have n random variables Yi and their possible values yi.
Note that for each value of xx the yy-values scatter about the mean E(y)E(y) according to a normal distribution centered at E(y)E(y) and with a standard deviation σσ that is the same for every value of xx
6
Simple Linear Regression Assumptions
3. Outliers: There should be no significant outliers (see Ch.13 Applied Statistics in R. David Dalpiaz)
2. Linear: The relationship between Y and x is linear
Normal: The errors ϵ are normally distributed
Note: the values of x are fixed. We do not make a distributional assumption about the predictor variable.
5. Equal Variance: The variances along the line of best fit remain similar.
1. Variables Type: Continuous (Interval or Ratio)
Inspect your Y and X relationship in scatterplot
4. Independence: You should have independence of observations
High leverage, Large residuals, Large Influence
(David Dalpiaz, 2019)
Heteroscedasticity
Homoscedasticity
Let’s discuss the assumptions of linear model: You need to know assumptions because it is only appropriate to use linear regression if your data "passes" six assumptions that are required for linear regression to give you a valid result.
“simple” means that y depends on only one other variable and not two or more
Assumption 1. Your two variables should be measured at the continuous level (i.e., they are either interval or ratio variables).
Assumption 2. Linear. The relationship between YY and xx is linear. you can plot the dependent variable against your independent variable and then visually inspect the scatterplot to check for linearity.
Assumption 3. There should not be any significant outliers. Examine your data for outliers with high leverage, large residuals and large influence.
Assumtion 4 Independent. You should have independence of observations
assumption 5. Equal Variance. At each value of xx, the variance of YY is the same, σ2. our data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar. take a look at the three scatterplots below, which provide three simple examples:
Homoscedasticity. The errors have constant variance about the true model.
Heteroscedasticity. The errors have non-constant variance about the true model.
7
Fitting the Model: The Method of Least Squares
Vertical distance between observed and predicted values
Find the line that minimizes the sum of all the squared distances from the points to the line
y-hat
fitted line
deviation
residual
the sum of squares of residuals
least squares estimates
We need to find β0 and β1 that make the SSE a minimum.
To determine how well the model fits our data, we need to measure the deviation of actual data points from the fitted line.
Deviations are errors of predictions (or residuals) and represented as vertical distances between observed data points and predicted values on the fitted line.
We could find the line that minimizes the sum of all the squared distances from the points to the line which is called the method of least squares
the fitted line is represented as y-hat - yˆ is a predictor of some future value of y;
The deviation of the y value from its predicted value, called the residual,
then the sum of squares of the these deviations is SSE
8
1
1
Model Summary in R: lm()
model = lm(dist ~ speed, data = cars)
response
predictor
summary(model)
Residuals: 5 summary points
Mean = 0
intercept = MEAN(distance) for x(speed) = 0
slope = for every 1 mph increase, the distance is increased by 3.9 feet
https://xkcd.com/605/
2
2
beta_zero
beta_one
The lm() command is used to fit linear models and one of the most commonly used tools.
We use the dist ~ speed syntax to tell R we would like to model the response variable dist as a linear function of the predictor variable speed. you should think of the syntax as response ~ predictor.
We call on the model using summary function.
Residual section breaks it down into 5 summary points. When assessing residuals you should look for a symmetrical distribution across these points on the mean value zero (0). we can see that the distribution of the residuals do not appear to be strongly symmetrical. That means that the model predicts certain points that fall far away from the actual observed points.
Coefficients: the coefficients are two unknown constants that represent the intercept and slope terms in the linear model.
The intercept parameter β0 tells us the mean stopping distance for a car traveling zero miles per hour. (Not moving.). This is a case of extrapolation, when x falls outside of data range. β0 will not always have a practical interpretation. Only
when x = 0 is within the range of the x-values in the sample and is a practical value will ˆβ0 have a meaningful interpretation. (compare zero degrees versus zero weight or speed).
The second coefficient is slope - the effect speed has in distance required for a car to stop. The slope term in our model is saying that for every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324088 feet.
9
Model Summary in R: lm()
summary(model)
Mean = 0
3
Standard Error: The standard deviation of an estimate. Low values are ideal.
3
4
t value: coefficient/std error
4
5
5
p value: individual p value for each parameter
6
6
Residual Standard Error: a measure of the quality of a linear regression fit
7
7
R-squared: how well the model is fitting the actual data
8
8
F-Statistic: indicator of a relationship between predictor and response
Felipe Rego, 2015. Quick Guide: Interpreting Simple Regression.
The standard error is an estimate of the standard deviation of the coefficient, the amount it varies across. The standard error of the coefficient is always positive.
The t statistic is the coefficient divided by its standard error.
The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0. high t value would indicate we could reject the null hypothesis
p-value - individual p value for each parameter to accept or reject null hypothesis, this is statistical estimate of x and y.
residual standard error - is measure of the quality of a linear regression fit. Theoretically, every linear model is assumed to contain an error term E. it is the average amount that the response (dist) will deviate from the true regression line. In our example, the actual distance required to stop can deviate from the true regression line by approximately 15.3795867 feet, on average.
we calculated residual standard errors with 48 degrees of freedom. degrees of freedom are the number of data points that went into the estimation of the parameters used after taking into account these parameters (restriction). In our case, we had 50 data points and two parameters (intercept and slope).
R-squared measures how well the model is fitting the actual data. It always lies between 0 and 1. In our example, the R2 is 0.65. Or roughly 65% of the variance found in the response variable (dist) can be explained by the predictor variable (speed).
In multiple regression settings, the R2 will always increase as more variables are included in the model. That’s why the adjusted R2 is the preferred measure as it adjusts for the number of variables.
F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is. how much larger the F-statistic needs to be depends on both the number of data points and the number of predictors. In our example the F-statistic is 89 which is relatively larger than 1 given the size of our data
10
Model Summary in Python: OLS
Add Intercept (None - by default)
lm()
import statsmodels.formula.api as smf
11
Workflow
STEP 1. Confirm Linear Relationship
data(cars)
with(cars, plot(y=dist, x=speed))
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
plt.style.use('seaborn’)
df = pd.read_csv(“cars.csv”)
df.plot(x = 'speed', y ='dist', kind='scatter')
plt.show()
The plot shows a fairly strong positive relationship
12
Workflow Example
STEP 2. Run Regression
model = lm(dist~speed, data=cars)
summary(model)
import statsmodels.api as sm
y = df.dist
x = df.speed
x = sm.add_constant(x)
model = sm.OLS(y, x)
results = model.fit()
print(results.summary())
STEP 3. Interpret Summary Output
13
Workflow
STEP 4. Create a plot with abline
library(ggplot2)
ggplot(cars, aes(x=speed, y=dist))+
geom_point()+
geom_smooth(method=lm, se=TRUE)
import seaborn as sns
sns.set(color_codes=True)
g = sns.lmplot(x="speed", y="dist", data=df)
14
/docProps/thumbnail.jpeg