Lab1_Linear_Regression
Linear Regression¶
In this notebook lab, you will explore the data and apply linear regression.
Copyright By PowCoder代写 加微信 powcoder
As you execute the cells, please also try to understand the code and the results, and think how to answer questions and follow instructions marked in purple color.
First we need to input all the packages we need.
import seaborn as sns
print(‘hello world’)
hello world
Let’s start experiments with linear regression modelling 4 different functions.
In order to compare their performance, let’s use R2 score.
Which function is the easiest and which is the hardest to model with linear regression (as measured by R2)?
def original(x):
def square(x):
return x*x
def change_func(N,func,noise_factor,title,i):
data_x = np.arange(1,N).reshape(-1,1) # Generate an array with values from 0 to N
#.reshape(-1,1) makes 1 col and as many rows as data allows
rdm = (np.random.rand(N-1)-0.5).reshape(-1,1) # Random noise generator. rand(N) returns N random numbers between 0 and 1
data_y = (func(data_x) + rdm*noise_factor) # different function wrt x, with noise
# Create and train linear regression model
regr = linear_model.LinearRegression()
regr.fit(data_x,data_y)
predicted_y = regr.predict(data_x)
metric = r2_score(data_y,predicted_y)
plt.subplot(2,2,i)
plt.title(title+”, R2=”+str(np.round(metric,4)))
plt.scatter(data_x,data_y,color=”b”,s=2)
plt.scatter(data_x,predicted_y,color=”r”,s=1)
plt.xticks([]), plt.yticks([])
plt.figure(1)
change_func(N,original,0,”y=x”,1)
change_func(N,np.sqrt,0,”y=sqrt(x)”,2)
change_func(N,square,0,”y=x**2″,3)
change_func(N,np.log,0,”y=log(x)”,4)
plt.show()
Reading Datasets¶
In order to perform machine learning, we typically need a significant amount of data. By understanding the data, analysing patterns and training our algorithms, we can achieve meaningful results. Scikit-learn makes it easy for us to access some pre-defined ‘toy’ datasets to practice our understanding.
In this example, we’ll use the “Life expectancy” dataset adapted to this lab, which contains 500 records. More information about the original version of this dataset can be found
The 9 features in the dataset represent Infant deaths, Alcohol, Percentage expenditure, Hepatitis B,Measles, BMI, Total expenditure, Population and Schooling. The target of interest is Life expectancy. We’ll use these fetures to find a regression to predict the target.
Later, we’ll apply a linear regression to the data, but some features will be better suited to this than others.
Read through the code below to understand how this particular data is structured.
Try to find which feature has the most linear relationship with the target.
Data Exploration¶
First, we input the datset of diabetes
# load the diabetes dataset
df=pd.read_csv(‘life_expectancy.csv’)
target_name=”Life expectancy”
target=df[target_name]
Sometimes you need to see the data plotted out to understand more. Seaborn is a library which is a wrapper over Matplotlib (the standard Python library for data plotting) and is extremely convenient to use. For example, the below plot shows a box plot of target values of all records.
sns.set_style(“whitegrid”)
#Box plot (the 5 numbers:min, median1 to the left of median2,
# median2 of all median, median3 to the right of median2, max)
sns.boxplot(x=target)
plt.show()
Using Seaborn, plot a scatter plot between schooling years and target value.
Notice a number of outliers in this plot as well as in the box plot above.
sns.regplot(x=’Schooling’,y=target_name,data=df)
plt.show()
Try to plot target against a couple of other variables.
Let’s now view the correlation heatmap for the different features for some more inspiration.
Notice that some variables are negatively correlated with the target.
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True #True values in upper triangle
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={“shrink”: .5})
plt.show()
sns.pairplot(df)#
plt.show()
By looking at the above correlation plots, try to predict which variable will produce least MSE when used in the linear regression model.
Now that we have a dataset loaded, and a feature of interest selected, we can try to fit a linear regression. Scikit-learn has methods for accomplishing this very simply. There are two steps involved:
Training the model: the linear regression model must be trained by supplying it with a sample feature set, and the corresponding targets.
Prediction: once the model is trained, you can provide it with a number of sample points and it will return its predicted targets.
After training the model, we can see the coefficients of the model; i.e., find the regression coefficient $a$ and the intercept $b$ in the (univariate) linear regression formula $y = ax + b$.
We can also calculate the mean squared error of the model points to give us something to compare models and feature choice.
Fill in the code necessary to load the dataset, train and visualise the linear regression.
#use already loaded dataset
feature_name = “Schooling” ## <-- Whichever feature you chose # your code here (based on the code given in the initial part of this notebook) #Output the coefficient print('Coefficients: ', regr.coef_[0], regr.intercept_) #Calculate the Mean Squared Error mse = np.mean((predicted_y - data_y) ** 2) print ("Mean squared error: %.2f" % mse) plt.scatter(data_x,data_y) plt.plot(data_x,predicted_y,color="red") plt.show() Coefficients: 2.3148995252546354 41.57747441152147 Mean squared error: 35.11 If done correctly, you should end up with a linear regression (i.e. straight line) which approximately follows the shape of the scattered data. You can try using other features in the dataset to see how the linearity of the data affects the MSE. Test vs. Training set¶ Typically, we do not want to test our model on the data with which we trained it, as this will not provide an accurate assessment of the model. However, we only have a limited amount of data to work with in this case. One way around this issue is to split the data into a 'training set' and a much smaller 'test set'. In this example, we'll take 30% of the dataset to use as our test set, and leave the rest for training. The code below splits data into test and training sets rather than the dataset as a whole. How is the mean squared error (MSE) affected when we test with the test set? Why? Try adjusting the test data size and see how the model accuracy is affected. from sklearn.model_selection import train_test_split # split data train_x, test_x, train_y, test_y = train_test_split( data_x, data_y, test_size=0.30) #Create linear regression model regr = linear_model.LinearRegression() #Train the model regr.fit(train_x,train_y) #Predict the targets predicted_y =regr.predict(test_x) #Output the coefficient print('Coefficients: ', regr.coef_[0], regr.intercept_) #Calculate the Mean Squared Error mse = np.mean((predicted_y - test_y) ** 2) print ("Mean squared error: %.2f" % mse) plt.scatter(train_x,train_y) plt.scatter(test_x,test_y, color="red") plt.plot(test_x,predicted_y,color="red") plt.show() Coefficients: 2.3114876105419855 41.72983976884685 Mean squared error: 38.87 Other kinds of measure for linear regression are used: from sklearn.metrics import explained_variance_score,mean_squared_error,r2_score def performance_metrics(y_true,y_pred): rmse = mean_squared_error(y_true,y_pred) r2 = r2_score(y_true,y_pred) explained_var_score = explained_variance_score(y_true,y_pred) y = np.array([13, 8, 11, 2, 6]) x = np.array([3, 6, 7, 8, 11]) return rmse,r2,explained_var_score rmsa,r2,explained_var_score=performance_metrics(test_y,predicted_y) print ("Root of mean squared error: %.2f" % rmsa) print ("R2-score: %.2f" % r2) # explained_variance_score is shown in lecture alide 96 as R-squared: # 1-(variance(pred-target)/variance(pred-mean)) print ("Explained variance score: %.2f" % explained_var_score) Root of mean squared error: 38.87 R2-score: 0.53 Explained variance score: 0.53 In the last part of this notebook you will explore local regression applied to variety of target functions. For this purpose let's modify the code used in the first part of ths lab replacing linear regression with local regression. Run the code below and try answering the following questions: For which target functions local regression performs better than linear regression and why? from sklearn.neighbors import KNeighborsRegressor def original(x): def square(x): return x*x def change_func(N,func,noise_factor,title,i): data_x = np.arange(1,N).reshape(-1,1) # Generate an array with values from 0 to N #.reshape(-1,1) makes 1 col and as many rows as data allows rdm = (np.random.rand(N-1)-0.5).reshape(-1,1) # Random noise generator. rand(N) returns N random numbers between 0 and 1 data_y = (func(data_x) + rdm*noise_factor) # different function wrt x, with noise # Create and train linear regression model regr = KNeighborsRegressor(n_neighbors=5) regr.fit(data_x,data_y) predicted_y = regr.predict(data_x) metric = r2_score(data_y,predicted_y) plt.subplot(2,2,i) plt.title(title+", R2="+str(np.round(metric,4))) plt.scatter(data_x,data_y,color="b",s=2) plt.scatter(data_x,predicted_y,color="r",s=1) plt.xticks([]), plt.yticks([]) plt.figure(1) change_func(N,original,0,"y=x",1) change_func(N,np.sqrt,0,"y=sqrt(x)",2) change_func(N,square,0,"y=x**2",3) change_func(N,np.log,0,"y=log(x)",4) plt.show() 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com