程序代写代做代考 algorithm decision tree SQL python deep learning gui Visualizations¶

Visualizations¶
1. Matplotlib
2. Seaborn
3. Bokeh
4. Plotly
Predictive Analytics¶
1. Linear Model (OLS)
2. Logistic Regression
3. Cluster Analysis
4. Decision Tree
5. Neural Nets
In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import preprocessing as pp
import statsmodels.formula.api as smf
%matplotlib inline
import matplotlib as mp
from matplotlib import pyplot as plt
import plotly as py
import plotly.graph_objects as go
import seaborn as sns

Matplotlib¶
Matplotlib is a Python 2D plotting library which produces various visualizations. This library has and is still being used extensively for scitntific publications. In this class we will be using Matplotlib within Jupyter but you can use it in python scripts, web app servers, etc.
You can find more graphs here: https://matplotlib.org/3.1.1/gallery/index.html#mplot3d-examples-index
In [3]:
#create a random dataset to visualize: creating a df with 1000 samples of floats from a standard normal distribution
#where A has a mean of 1, B is random, and C has a mean of -1
df = pd.DataFrame({‘A’: np.random.randn(1000) + 1, ‘B’: np.random.randn(1000),’C’: np.random.randn(1000)-1})
In [4]:
print (df.head())

A B C
0 0.984669 1.114679 -0.292141
1 0.282889 0.784411 -1.907893
2 0.389399 -0.754609 0.507971
3 0.541013 -0.322884 -0.151981
4 2.096325 0.274032 -2.179303
In [5]:
df[‘A’].mean()
df[‘B’].mean()
df[‘C’].mean()

df.shape
Out[5]:
(1000, 3)

Plotting a Line¶
In [8]:
#plotting a line on a graph

#create x axis coordinates
x_coord = [20, 30, 40]
#create y axis coordinates
y_coord = [4, 10, 5]
#plot both coordinates
plt.plot(x_coord,y_coord)
#create a label for x and y axes
plt.xlabel(‘X axis coordinates’)
plt.ylabel(‘Y axis coordinates’)
#give the graph a title
plt.title(‘Plotting a Line’)
plt.show()

YOUR INSTRUCTIONS ARE BELOW – TO BE GRADED¶
A) Try to graph two lines on the same plot. For x and y axis coordinates, us the first 3 rows from column A and column B from df respectively, so that the first line has column A as x axis coordinates, and column B as y axis ccordiunates, while second line has the opposite, column A as y axis coordinates, and column B as x axis ccordiunates.
(hint: you can select the first 3 rows from df using loc)
In [10]:
### YOUR CODE HERE – TO BE GRADED
#it should look something like this: the lines may not be graphed like this since the random numbers change

Bar Graph¶
Bar graphs are displays of categorical (group) data with rectangular bars with heights or lengths proportional to the values that they represent. These are good for group comparison tasks.
In python the bar() function automatically generates a bar graph if given the categories and their respective counts along with some optional parameters such as color, etc.
In [26]:
#Survey 145 people were asked “Which is the nicest fruit?” Let’s graph the results of this survey using a bar graph.
fruits = [‘Apple’, ‘Orange’, ‘Banana’, ‘Kiwifruit’, ‘Blueberry’, ‘Grapes’]
answers = [35,30,10,25,40,5]
colors = [‘red’, ‘orange’, ‘yellow’, ‘green’, ‘blue’, ‘purple’]
#plt.bar(fruits, answers, width = 0.9, color = colors)
plt.barh(fruits, answers, color = colors)
plt.xlabel(“Fruit”)
plt.ylabel(“People”)
plt.title(“The nicest fruits according to 145 respondents of our survey”)
plt.show()


In [27]:
#Using the same dataset as above, plot two bar plots one with horizontal the other vertical bars.
#The parameters for subplot are: number of rows, number of columns, the number barplot is. This is 2 row and 1 column subplot
plt.subplot(2, 1, 1)
plt.bar(fruits, answers, width = 0.9, color = colors)
plt.subplot(2, 1, 2)
plt.barh(fruits, answers, color = colors)
plt.xlabel(“Fruit”)
plt.ylabel(“People”)
plt.title(“The nicest fruits according to 145 respondents of our survey”)
plt.tight_layout()
plt.show()

Histogram¶
Unlike bar graphs, a histogram is used to plot the frequency of given metric’s occurrences in a continuous data set. So the x axis is the metric itself while the y axis is the frequency of a particular number. Hence, a histogram is a graphical display of data using bars of different heights. In a histogram, each bar groups numbers into ranges. Taller bars show that more data falls in that range.
In python the hist() function automatically generates histograms and returns the bin counts or probabilities. Unlike bar graphs, in historgams can’t be reordered.
In [14]:
#overlaying 3 histograms
plt.hist(df[‘A’], bins=10, facecolor=’red’, alpha = 0.5, label = ‘A’)
plt.hist(df[‘B’], bins=10, facecolor=’green’, alpha = 0.5, label=’B’)
plt.hist(df[‘C’], bins=10, facecolor=’purple’, alpha = 0.5, label=’C’)
#add a legend to the graph using the legend()
plt.legend()
#crete axes with labels
plt.gca().set(title=’Frequency Histogram of our three random columns’, ylabel=’Frequency’);


In [31]:
#or you can call plot() function on df dataframe
#use alpha to set the strength of the color
df.plot(kind=’hist’, alpha = .7)
plt.show()

Pie chart¶
In [15]:
#plotting a pie chart using pie()
plt.pie(answers, labels = fruits, colors=colors)
plt.legend()
plt.show()

YOUR INSTRUCTIONS ARE BELOW – TO BE GRADED¶
A) Choose any dataset that’s not being used already in this notebook and visualize it (any column(s) or the entire dataset) using any graph in matplotlib.
B) Make sure your graph has a title, x and y-axis labels, and has a color you selected and was not the default color.
In [ ]:
### YOUR CODE HERE – TO BE GRADED

Seaborn¶
Seaborn is a Python data visualization library based on matplotlib. It’s static just like matplotlib but it provides a more indepth, attractive, and informative statistical graphics than common matplotlib graphics.
See some of the example: https://seaborn.pydata.org/examples/index.html
In [34]:
#setting figure size
plt.figure(figsize=(10,7), dpi= 80)
#define a simple function to plot some waves (trigonometry)
def sinplot(flip=1):
x = np.linspace(0, 14, 100)
for i in range(1, 7):
plt.plot(x, np.sin(x + i * .5) * (7 – i) * flip)

plt.subplot(2, 2, 1)
sns.set_style(“white”)
sinplot()

plt.subplot(2, 2, 2)
sns.set_style(“whitegrid”)
sinplot()

plt.subplot(2, 2, 3)
sns.set_style(“dark”)
sinplot()

plt.subplot(2, 2, 4)
sns.set_style(“darkgrid”)
sinplot()

#sns.set_style(“ticks”) #adds ticks
#sns.set() # detault is darkgrid
#sns.despine() #removes the top and right lines

#more options here: https://seaborn.pydata.org/tutorial/aesthetics.html

Histogram & Density plots¶
A density plot is a representation of the probability of density distribution, a distribution of a numeric variable where x is still the numeric expression of the metric and y is the probability. It uses a kernel density estimate to show the probability density function of the variable: the porbability of the metric being equal to a specific number in the distribution. It is a smoothed version of the histogram and does not have the bars or the related change in the distribution as we change the number of bins.
In [42]:
#reading in the diamonds data
df = pd.read_csv(‘diamonds.csv’)
df.head()
#creating 3 different lists

x1 = df.loc[df.cut==’Ideal’, ‘depth’]
x2 = df.loc[df.cut==’Fair’, ‘depth’]
x3 = df.loc[df.cut==’Good’, ‘depth’]
In [44]:
#This function combines the matplotlib hist function (with automatic calculation of a good default bin size)
#with the seaborn kdeplot() function, which is plotting kernel density estimate.
#setting figure size
plt.figure(figsize=(10,7), dpi= 80)
#set up the three hist, with color, label and for x1 also linewidth and depth of color on the hist
sns.distplot(x1, color = ‘dodgerblue’, label=”Compact”, hist_kws={“linewidth”: 2, “alpha”: 0.8})
sns.distplot(x2, color = ‘green’, label=”SUV”)
sns.distplot(x3, color = ‘deeppink’, label=”minivan”)
#limit the x axis length
plt.xlim(50,75)
#plot the legend using labels
plt.legend()
plt.show()

YOUR INSTRUCTIONS ARE BELOW – TO BE GRADED¶
A) Load tips dataset as pandas dataframe from: https://github.com/mwaskom/seaborn-data/blob/master/tips.csv
B) Print out the data type to make sure it is a dataframe and display the top 14 rows
C) Find the unique categories from time column
D) Create a Series dinner_tips that contains only tips column from the tips dataframe that has all the rows where time is Dinner. Create a Series lunch_tips that contains only tips column from the tips dataframe that has all the rows where time is Lunch.
E) Graph both dinner_tips and lunch_tips as histograms on one graph where dinner histogram is purple and the lunch is green. Make sure the graph has a legend and title.
In [ ]:
### YOUR CODE HERE – TO BE GRADED

Heatmap¶
Heatmaps are 2D graphical representations of data values in a data matrix represented by color.
In [48]:
#load flights dataset
flights = sns.load_dataset(“flights”)
#look at top 10 rows
display(flights.head())
#pivot the table making the months as the index the years as the columns and the passenger count as the values
flights = flights.pivot(“month”, “year”, “passengers”)
display(flights.head())
type(flights)

year
month
passengers
0
1949
January
112
1
1949
February
118
2
1949
March
132
3
1949
April
129
4
1949
May
121

year
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
month

January
112
115
145
171
196
204
242
284
315
340
360
417
February
118
126
150
180
196
188
233
277
301
318
342
391
March
132
141
178
193
236
235
267
317
356
362
406
419
April
129
135
163
181
235
227
269
313
348
348
396
461
May
121
125
172
183
229
234
270
318
355
363
420
472
Out[48]:
pandas.core.frame.DataFrame
In [51]:
#setting the size of the figure
plt.figure(figsize=(9,6), dpi= 80)
#plotting the heatmap:annot writes the data value in each cell, without it you will not have the numbers
sns.heatmap(flights, annot=True, fmt=”d”)
plt.show()

Scatterplot¶
Pairplots create two basic figures: histograms and scatter plots. The histogram on the diagonal allows us to see the distribution of a single variable while the scatter plots on the upper and lower triangles show the relationship (or lack thereof) between two or more variables.
In [52]:
#load the data
df = sns.load_dataset(“iris”)
#create the pairplot using species for the color coding
sns.pairplot(df, hue=”species”)
plt.show()

Which features seem to have the highest correlation?¶

Bokeh¶
Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients.
The major concept of Bokeh is that graphs are built up one layer at a time. We start out by creating a figure, and then we add elements, called glyphs, to the figure. (For those who have used ggplot, the idea of glyphs is essentially the same as that of geoms which are added to a graph one ‘layer’ at a time.) Glyphs can take on many shapes depending on the desired use: circles, lines, patches, bars, arcs, and so on.
The design and use of Bokeh is based on Leland Wilkinson’s Grammar of Graphics (GoG).
Find more here: http://docs.bokeh.org/en/1.3.2/docs/gallery.html#gallery

Philosophy: Grammar of Graphics¶
The Grammar of Graphics is an idea of Leland Wilkinson. Its basic idea is that the way most people think about visualizing data is ad hoc and unsystematic, whereas there exists in fact a “formal language” for describing visual displays.
The reason why this idea is important and powerful in the context of our course is that it makes visualization more systematic, thereby making it easier to create those visualizations through code.
The high-level concept is simple:
1. Start with a (tidy) data set.
2. Transform it into a new (tidy) data set.
3. Map variables to geometric objects (e.g., bars, points, lines) or other aesthetic “flourishes” (e.g., color).
4. Rescale or transform the visual coordinate system.
5. Render and enjoy!


This image is “liberated” from: http://r4ds.had.co.nz/data-visualisation.html
In [53]:
from bokeh.io import output_notebook, show
output_notebook()

Loading BokehJS …

Line Graph¶
In [43]:
#Let’s illustrate the idea of glyphs by making a basic chart.
# First, we make a plot using the figure method and then we append our glyphs to the plot
# by calling the appropriate method and passing in data.

from bokeh.plotting import figure, output_file, show

# prepare some data
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]

# output to static HTML file
output_file(“lines.html”)

# create a new plot with a title and axis labels
p = figure(title=”simple line example”, x_axis_label=’x’, y_axis_label=’y’)

# add a line renderer with legend and line thickness
p.line(x, y, legend_label=”Temp.”, line_width=2)

# show the results
show(p)

Bar Charts¶
In [44]:
fruits = [‘Apples’, ‘Pears’, ‘Nectarines’, ‘Plums’, ‘Grapes’, ‘Strawberries’]

# Set the x_range to the list of categories above
p = figure(x_range=fruits, plot_height=250, title=”Fruit Counts”)

# Categorical values can also be used as coordinates
p.vbar(x=fruits, top=[5, 3, 4, 2, 4, 6], width=0.9)

# Set some properties to make the plot look better
p.xgrid.grid_line_color = ‘gray’
p.y_range.start = 0

show(p)

Plotly¶
Unlike matplotlip and others, this is a visualization library that offers interactivity.
You need to install it first:
#pip
pip install plotly
#or
pip install plotly –upgrade

#conda
conda install plotly
Then you need to import plotly graph objects
import plotly.graph_objects as go
Find more examples here: https://plot.ly/python/creating-and-updating-figures/
In [65]:
#using Figure object
import plotly.graph_objects as go
fig = go.Figure(data=go.Bar(y=[2, 3, 1]))
fig.show()

In [45]:
#using FigureWidget object
import plotly.graph_objects as go
fig = go.FigureWidget(data=go.Bar(y=[2, 3, 1]))
fig

FigureWidget({
‘data’: [{‘type’: ‘bar’, ‘uid’: ‘3b102a4b-db7c-4874-a393-4b519cf71eaa’, ‘y’: [2, 3, 1]}],

Scatterplot¶
In [67]:
import plotly.express as px
iris = px.data.iris()
fig = px.scatter(iris, x=”sepal_width”, y=”sepal_length”, color=”species”,
size=’petal_length’, hover_data=[‘petal_width’])
fig.show()

Bar Charts¶
In [68]:
import plotly.graph_objects as go
#creating a list of animals
animals=[‘giraffes’, ‘orangutans’, ‘monkeys’]

#creating the bar chart
fig = go.Figure(data=[
go.Bar(name=’SF Zoo’, x=animals, y=[20, 14, 23]), #creating first set of bar chart
go.Bar(name=’LA Zoo’, x=animals, y=[12, 18, 29]) #creating second set of bar chart
])
#change the bar chart mode to grouped
fig.update_layout(barmode=’group’)
fig.show()

Apache Superset¶
Apache Superset is a data exploration and visualization web application. It offers:
• A GUI interface to explore and visualize datasets,
• An ability to create interactive dashboards,
• A GUI set of filters and drill downs,
• SQL connection to interact with the underlying data,
• A set of security measures (e.g., OpenID, LDAP, OAuth, REMOTE_USER), etc.
Link to the Documentation
Link to the Installation Guide

Predictive Analytics¶
1. Unsupervised Learning Methods: there is no y to predict hence no training and testing but rather these are methods to discover patterns.
• Clustering (e.g., k-Means clustering (top down); hierarchical clustering (bottom up))
• Principal component analysis (PCA)
1. Supervised Learning Methods: these methods require that the model is trained on a set of the data and tested on a separate set of the same data since we are trying to predict y from given x(es).
• Regression (e.g., simple/multiple linear, logistic)
• Classification (e.g.)

Linear Regression (OLS model)¶
Linear Regression is a way of predicting a response (dependent variable) Y on the basis of one or more predictor variables (independent variables) X accounting for error e. Linear regressioin requires continous / numerical data. Hence, non continous data need to be dummy encoded. This model also doesn’t deal well with missing data, so we would need to remove missing data (e.g., remove rows with missing data; replace missing values using mean/mode, etc.).
Simple linear regression means we have only 1 X while multiple linear regression means we have more than 1 X-es. It is assumed that there is approximately a linear relationship between X and Y. Mathematically, we can represent this relationship as:
$Y = \beta_0 + \beta_1 * x + \varepsilon$
where Y is the DV, x is the IV, $\beta_0$** is the y-intercept, and $\beta_1$ is the slope of the line
Let’s try to predict the price of the housing based on x variable (we will decide which one it is as we explore the dataset).
Really great manual calculation exercise: https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/regression-analysis/find-a-linear-regression-equation/
DV – continuous; IV – continuous
In [47]:
#read in the data
housing_df = pd.read_csv(‘USA_Housing.csv’)
#look at the nrow and ncol
housing_df.shape
#get a summary stats for the df
housing_df.describe()
#see the top 10 rows
housing_df.head()
#get the list of columns
housing_df.columns
Out[47]:
Index([‘Avg. Area Income’, ‘Avg. Area House Age’, ‘Avg. Area Number of Rooms’,
‘Avg. Area Number of Bedrooms’, ‘Area Population’, ‘Price’, ‘Address’],
dtype=’object’)
In [48]:
housing_df.head()
Out[48]:

Avg. Area Income
Avg. Area House Age
Avg. Area Number of Rooms
Avg. Area Number of Bedrooms
Area Population
Price
Address
0
79545.458574
5.682861
7.009188
4.09
23086.800503
1.059034e+06
208 Michael Ferry Apt. 674\nLaurabury, NE 3701…
1
79248.642455
6.002900
6.730821
3.09
40173.072174
1.505891e+06
188 Johnson Views Suite 079\nLake Kathleen, CA…
2
61287.067179
5.865890
8.512727
5.13
36882.159400
1.058988e+06
9127 Elizabeth Stravenue\nDanieltown, WI 06482…
3
63345.240046
7.188236
5.586729
3.26
34310.242831
1.260617e+06
USS Barnett\nFPO AP 44820
4
59982.197226
5.040555
7.839388
4.23
26354.109472
6.309435e+05
USNS Raymond\nFPO AE 09386
In [49]:
#now let’s explore price, our DV
a = sns.distplot(housing_df[‘Price’])
plt.show(a)


In [71]:
#let’s see the correlations of all numeric columns: we can see that price and average area income are pretty highly correlated
sns.pairplot(housing_df)
plt.show()


In [72]:
#we can see exact numbers with corr method
#correlation coefficient is an index that ranges from -1 to 1.
#When the value is 0 it means there is no correlation. Closer to 1 or -1 means higher positive or negative correlation
housing_df.corr()
Out[72]:

Avg. Area Income
Avg. Area House Age
Avg. Area Number of Rooms
Avg. Area Number of Bedrooms
Area Population
Price
Avg. Area Income
1.000000
-0.002007
-0.011032
0.019788
-0.016234
0.639734
Avg. Area House Age
-0.002007
1.000000
-0.009428
0.006149
-0.018743
0.452543
Avg. Area Number of Rooms
-0.011032
-0.009428
1.000000
0.462695
0.002040
0.335664
Avg. Area Number of Bedrooms
0.019788
0.006149
0.462695
1.000000
-0.022168
0.171071
Area Population
-0.016234
-0.018743
0.002040
-0.022168
1.000000
0.408556
Price
0.639734
0.452543
0.335664
0.171071
0.408556
1.000000

YOUR INSTRUCTIONS ARE BELOW – TO BE GRADED¶
A) Plot the correlations results as a heatmap
In [ ]:
### YOUR CODE HERE – TO BE GRADED

Which X variable should we use?¶
In [78]:
#let’s now create a training and test set

#separate x and y vsariables and leave out address since it is not a numeric variable
x = housing_df[[‘Avg. Area Income’, ‘Avg. Area House Age’, ‘Avg. Area Number of Rooms’,
‘Avg. Area Number of Bedrooms’, ‘Area Population’]]
y = housing_df[‘Price’]

#the goal is to create a model that generalises accurately to new data.
#we test and tune our model with training sets and we apply it to testing set so our results are not contaminated
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(3500, 5)
(3500,)
(1500, 5)
(1500,)
In [79]:
#showcase how the content looks: not in order (see the index) bc it is randomly selected
print(x_train.head(5))

Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms \
2654 86690.873301 6.259901 6.676265
2468 59866.947700 5.870330 5.899076
290 74372.138452 6.562380 8.184511
1463 61370.323490 6.529605 6.606744
4508 52652.652336 5.688943 7.217268

Avg. Area Number of Bedrooms Area Population
2654 3.23 42589.624391
2468 4.16 32064.597156
290 6.35 34321.960155
1463 4.30 20600.511000
4508 4.06 34776.585907
In [80]:
from sklearn.linear_model import LinearRegression
#create an empty linear model (lm)
lm = LinearRegression()
#pass in (fit) the training data to the model
lm.fit(x_train,y_train)
#grab some predictions from the test set
predictions = lm.predict(x_test)
#print 10 predictions for price
print(predictions[:10])
#print corresponding 10 actual prices
print(y_test[:10])

[1258934.89505291 822694.63411044 1742214.3953012 972937.0046516
994545.99157748 644486.33951947 1078071.03841126 854756.03225045
1445901.34459024 1203355.11911772]
1718 1.251689e+06
2511 8.730483e+05
345 1.696978e+06
2521 1.063964e+06
54 9.487883e+05
2866 7.300436e+05
2371 1.166925e+06
2952 7.054441e+05
45 1.499989e+06
4653 1.288199e+06
Name: Price, dtype: float64
In [81]:
plt.scatter(y_test,predictions, )
plt.title(‘Prediction results vs. actual results’)
plt.xlabel(‘Actual Price’)
plt.ylabel(‘Predicted Price’)
Out[81]:
Text(0, 0.5, ‘Predicted Price’)


In [82]:
print (“Mean squared error: %.1f” % mean_squared_error(y_test, predictions))

#get the r-squared: it’s interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.
#variance is a measure of how far observed values differ from the average of predicted values
#r2 score—varies between 0 and 100%
#R^2 = (total variance explained by model) / total variance.” So if it is 100%, the two variables are perfectly correlated, i.e., with no variance at all.
print (‘Coefficient of Determination (R-squared) or Variance score: %.2f’ % r2_score(y_test, predictions))

Mean squared error: 10169125565.9
Coefficient of Determination (R-squared) or Variance score: 0.92
In [83]:
%matplotlib inline
import statsmodels.api as sm
#create linear model using Ordinary Least Squares from statsmodels library instead of teh previous scikit
#create the model and fit the the data
model = sm.OLS(y_test, x_test).fit()
display (model.summary())

predictions = model.predict(x_test) # make the predictions by the model
plt.scatter(y_test, predictions)

OLS Regression Results
Dep. Variable:
Price
R-squared (uncentered):
0.965
Model:
OLS
Adj. R-squared (uncentered):
0.965
Method:
Least Squares
F-statistic:
8252.
Date:
Mon, 23 Mar 2020
Prob (F-statistic):
0.00
Time:
17:20:42
Log-Likelihood:
-20719.
No. Observations:
1500
AIC:
4.145e+04
Df Residuals:
1495
BIC:
4.147e+04
Df Model:
5

Covariance Type:
nonrobust

coef
std err
t
P>|t|
[0.025
0.975]
Avg. Area Income
10.5040
0.488
21.537
0.000
9.547
11.461
Avg. Area House Age
5.439e+04
5217.462
10.425
0.000
4.42e+04
6.46e+04
Avg. Area Number of Rooms
-1.103e+04
5859.743
-1.882
0.060
-2.25e+04
468.931
Avg. Area Number of Bedrooms
5339.2248
5686.801
0.939
0.348
-5815.732
1.65e+04
Area Population
7.3345
0.589
12.448
0.000
6.179
8.490
Omnibus:
0.359
Durbin-Watson:
2.035
Prob(Omnibus):
0.836
Jarque-Bera (JB):
0.326
Skew:
-0.035
Prob(JB):
0.849
Kurtosis:
3.014
Cond. No.
9.32e+04

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.32e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Out[83]:

Clustering (k-means)¶
Grouping observations based on similarity. We are not predicting anything.
Features – continuous
In [86]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Let’s creating a sample dataset with 4 clusters
#f = features; l = labels
#n_samples is the total number of points equally divided among clusers: len(f) = 400
#centers is the number of centers to be generated
f, l = make_blobs(n_samples=400, n_features=2, centers=3)
#print(len(f))
#see first 10 rows of the sample
print (f[:10])
#print the cluster membership for those 10 rows
print (l[:10])

[[ 2.14489301 9.44533546]
[ 6.71524763 -8.65286316]
[ 3.14673445 7.0529238 ]
[ 6.44669895 -8.51041765]
[ 3.0764521 7.41705062]
[ 2.14476438 7.66138016]
[ 3.89402702 8.97895703]
[ 4.08776116 7.49410285]
[-5.22283424 3.61836206]
[-5.55625091 5.45125355]]
[1 2 1 2 1 1 1 1 0 0]
In [98]:
#Plot it just to visualize it first
plt.rcParams[‘figure.figsize’] = (16, 9)
#f[:,0] is the column a or 0 element of each sublist/feature

# ‘bo’ is a combination of color=’blue’, marker=’o’ properties
#e.g. ‘r+’ = red pluses, ‘go’ = green circles, ‘d’ = diamonds
plt.plot(f[:,0],f[:,1],’bo’)
plt.show()


In [96]:
#Run a cluster model
from sklearn import cluster
CLUSTERS = 3
#this creates an empty model
k_means = cluster.KMeans(n_clusters=CLUSTERS)
#fit supplies the data to be fit into the emtpy model
k_means.fit(f)

#this is the list of 400 cluster numbers
#print(k_means.labels_)
#these are the coordinates of cluster centers
#print(k_means.cluster_centers_)
#this was the initially assigned labels/groupings
#print(l)
Out[96]:
KMeans(algorithm=’auto’, copy_x=True, init=’k-means++’, max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances=’auto’,
random_state=None, tol=0.0001, verbose=0)
In [97]:
#plot cluster results
%matplotlib inline
labels = k_means.labels_
centroids = k_means.cluster_centers_

colors = [‘green’, ‘blue’, ‘orange’]
for i in range(CLUSTERS):
#find the ones that are in the labels (first iteration cluster = 0)
ds = f[np.where(labels==i)]
# plot the data observations
plt.plot(ds[:,0],ds[:,1],’o’, color = colors[i])
# plot the centroids
#lines =
plt.plot(centroids[i,0],centroids[i,1],’kx’)
plt.show()

#MSE
print(k_means.score(f))

-768.6928555843988
In [99]:
#use elbow chart to help figure out how many clusters to use
#there is elbow methos, silhouette method, gap statistics, etc.
#elbow looks at within-cluster sum of sqares which is the variability
#of the observations in each cluster: the smaller the less variable
def plot_elbow(data, cluster_cnt = 6):
CLUSTERS = range(1, cluster_cnt)
kmeans = [cluster.KMeans(n_clusters=i) for i in CLUSTERS]

score = [kmeans[i].fit(data).score(data) for i in range(len(kmeans))]
plt.plot(CLUSTERS ,score)
plt.xlabel(‘Number of Clusters’)
plt.ylabel(‘Sum of Squared Distances’)
plt.title(‘Elbow Curve’)
plt.xticks(CLUSTERS)
plt.show()

plot_elbow(f)


In [100]:
from yellowbrick.cluster import KElbowVisualizer

visualizer = KElbowVisualizer(k_means, k=(1,6))
visualizer.fit(f)
visualizer.show()


Out[100]:

In [101]:
#redo the same with hierarchical clustering
%matplotlib inline
#hierarchical clustering
x, y = make_blobs(n_samples=10, n_features=2, centers=3)
print (x)
print (y)

from scipy.cluster.hierarchy import dendrogram, linkage
#select the metric (e.g., euclidean) and method (e.g. single, complete) you want
#explanation: [idx1, idx2, dist, sample_count]
#e.g., 4.678.. is mergerd with 4.208.., the distance is 0.471, this created a cluster with 2 samples
#more info: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
z = linkage(x, ‘ward’)
print(z)
dendrogram(z, leaf_rotation = 90, leaf_font_size=12)

[[10.28899231 -0.59039017]
[-2.67976732 -5.36646509]
[-5.5087723 9.83517143]
[ 7.1886078 -1.64264512]
[10.47495327 -0.85971987]
[-4.54423869 9.44935892]
[-4.99123226 6.99721265]
[-2.99421284 -6.751922 ]
[-3.18677563 -5.56700614]
[-2.73515089 -6.94826804]]
[2 0 1 2 2 1 1 0 0 0]
[[ 7. 9. 0.32506132 2. ]
[ 0. 4. 0.32729186 2. ]
[ 1. 8. 0.54522852 2. ]
[ 2. 5. 1.03883414 2. ]
[10. 12. 1.95876889 4. ]
[ 6. 13. 3.05451515 3. ]
[ 3. 11. 3.8365873 3. ]
[14. 16. 24.5322385 7. ]
[15. 17. 30.11137604 10. ]]
Out[101]:
{‘icoord’: [[15.0, 15.0, 25.0, 25.0],
[5.0, 5.0, 20.0, 20.0],
[35.0, 35.0, 45.0, 45.0],
[55.0, 55.0, 65.0, 65.0],
[40.0, 40.0, 60.0, 60.0],
[85.0, 85.0, 95.0, 95.0],
[75.0, 75.0, 90.0, 90.0],
[50.0, 50.0, 82.5, 82.5],
[12.5, 12.5, 66.25, 66.25]],
‘dcoord’: [[0.0, 1.0388341443644111, 1.0388341443644111, 0.0],
[0.0, 3.0545151472973866, 3.0545151472973866, 1.0388341443644111],
[0.0, 0.3250613224623256, 0.3250613224623256, 0.0],
[0.0, 0.5452285220360039, 0.5452285220360039, 0.0],
[0.3250613224623256,
1.9587688900357023,
1.9587688900357023,
0.5452285220360039],
[0.0, 0.32729185872557115, 0.32729185872557115, 0.0],
[0.0, 3.836587296324641, 3.836587296324641, 0.32729185872557115],
[1.9587688900357023,
24.532238502133634,
24.532238502133634,
3.836587296324641],
[3.0545151472973866,
30.11137604170229,
30.11137604170229,
24.532238502133634]],
‘ivl’: [‘6’, ‘2’, ‘5’, ‘7’, ‘9’, ‘1’, ‘8’, ‘3’, ‘0’, ‘4’],
‘leaves’: [6, 2, 5, 7, 9, 1, 8, 3, 0, 4],
‘color_list’: [‘g’, ‘g’, ‘r’, ‘r’, ‘r’, ‘c’, ‘c’, ‘b’, ‘b’]}

YOUR INSTRUCTIONS ARE BELOW – TO BE GRADED¶
Create an example of either clustering (hierarchical or k-means) or linear regression with any data and with graphical depiction of the cluster or the regression.
In [ ]:
### YOUR CODE HERE – TO BE GRADED

Classification Algorithms¶
A) Logistic Regression
B) Decision Tree
C) Neural Network
All these methods share common steps:
• categorical data needs to be encoded as dummy data
• data needs to be split into training and test

A) Logistic Regression¶
Unlike the previous models, this model allows you to set your threshold and thus adjust the cost of False Positives and False Negatives.
This type of model requires us to dummy encode the data – take the categorical data and convert into binary.
DV – binary/categorical IV

B) Decision Tree¶
In this model, data is recursively spliit up based on independent variables until the set of rules are evaluated to be good.
It’s very easy to implement and interpret when there are rmanageable number of features. It also performs well with categorical and continuous data but due to continuous tree splits, it is a computationally heavy algorithm.
DV – categorical; IV – categorical and continuous
In [112]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics

YOUR INSTRUCTIONS ARE BELOW – TO BE GRADED¶
A) what is wrong with the loading of the pima dataframe below?
B) how can you fix it?
In [ ]:
### YOUR CODE HERE – TO BE GRADED

col_names = [‘pregnant’, ‘glucose’, ‘bp’, ‘skin’, ‘insulin’, ‘bmi’, ‘pedigree’, ‘age’, ‘label’]
# load dataset
pima = pd.read_csv(“pima-indians-diabetes.csv”, header=None, names=col_names)
pima.head()
In [107]:
#correct data loading. code is deleted to let you solved the issue above
Out[107]:

pregnant
glucose
bp
skin
insulin
bmi
pedigree
age
label
0
6
148
72
35
0
33.6
0.627
50
1
1
1
85
66
29
0
26.6
0.351
31
0
2
8
183
64
0
0
23.3
0.672
32
1
3
1
89
66
23
94
28.1
0.167
21
0
4
0
137
40
35
168
43.1
2.288
33
1
In [108]:
#split dataset in features and target variable
feature_cols = [‘pregnant’, ‘insulin’, ‘bmi’, ‘age’,’glucose’,’bp’,’pedigree’]
X = pima[feature_cols] # Features
X[:10]
y = pima.label # Target variable
y[:10]
Out[108]:
0 1
1 0
2 1
3 0
4 1
5 0
6 1
7 0
8 1
9 1
Name: label, dtype: int64
In [109]:
pima.head()
Out[109]:

pregnant
glucose
bp
skin
insulin
bmi
pedigree
age
label
0
6
148
72
35
0
33.6
0.627
50
1
1
1
85
66
29
0
26.6
0.351
31
0
2
8
183
64
0
0
23.3
0.672
32
1
3
1
89
66
23
94
28.1
0.167
21
0
4
0
137
40
35
168
43.1
2.288
33
1
In [110]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
In [113]:
# Create an empty Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)
In [114]:
#highest numbers point to most important features
print(clf.feature_importances_)
print(X_test.columns)

[0.03944864 0.04987193 0.20032745 0.11854637 0.3018719 0.1413803
0.1485534 ]
Index([‘pregnant’, ‘insulin’, ‘bmi’, ‘age’, ‘glucose’, ‘bp’, ‘pedigree’], dtype=’object’)

$$AC= \frac{TN+TP }{TN+FP+FN+TP}$$
In [115]:
print(“Accuracy:”,metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.670995670995671
In [116]:
#this function helps us find the most important features
def cm_percent(cm, length, legend = True):
import numpy as np
if legend:
print (‘ TN’, ‘FP\n’, ‘FN’, ‘TP’)
return np.ndarray(shape = (2,2), buffer = np.array([100 *(cm[0][0] + cm[1][1])/length,
100 * cm[0][1]/length, 100 * cm[1][0]/length, 100 * (cm[1][0] + cm[0][1])/length]))

def important_features(model, columns):
return pd.DataFrame(model.feature_importances_, columns=[‘Importance’], index = columns).sort_values([‘Importance’], ascending = False)

y_pred = clf.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print (cm)
print (cm_percent(cm, len(X_test)))
print (y_test.value_counts(), len(y_test))
print (important_features(clf, X_train.columns))
#TN = True negative = predicted correctly as negative
#FP = False positive = wrongly predicted as positive
#FN = False negative = wrongly predicted as negative
#TP = True positive = correctly predicted as postiive

[[114 32]
[ 44 41]]
TN FP
FN TP
[[67.0995671 13.85281385]
[19.04761905 32.9004329 ]]
0 146
1 85
Name: label, dtype: int64 231
Importance
glucose 0.301872
bmi 0.200327
pedigree 0.148553
bp 0.141380
age 0.118546
insulin 0.049872
pregnant 0.039449

C) Neural Nets¶
Tries to simulate how the brain works – creates lots of individual neurons that take pieces of information (input of lots of images), prrocesses it (is it a dog), and returns an answer back (output of a label). All these outputs are further processed
We starrt with as many layers as we have features, then we
Tools & Resources:
1. Orange:
• download: https://orange.biolab.si/
• examples and tutorials: https://docs.biolab.si//3/visual-programming/index.html
• In class demo: example_for_class.ows
2. Deep Learning by Jon Krohn
• https://www.dropbox.com/s/r2fr16n1l4svo4x/ODSC-West_2019-10-30.pdf?dl=0
• https://www.deeplearningillustrated.com/
3. Jason Yosinski’s Deep Visualization Toolbox

• How to install: https://github.com/yosinski/deep-visualization-toolbox
• Tutorial: https://www.youtube.com/watch?v=AgkfIQ4IGaM
4. TensorFlow playground
• Deep playground is an interactive visualization of neural networks, written in TypeScript using d3.js. It lets you play with a real neural network running in your browser and click buttons and tweak parameters to see how it works.
• https://playground.tensorflow.org/
• Tensorflow is based on Theano and has been developed by Google, whereas PyTorch is based on Torch and has been developed by Facebook.
Shallow Net example: https://colab.research.google.com/drive/1h-pJ0oO2FXIS7wi8ODqZ-YnDJErDrPjs Deep Next example:https://colab.research.google.com/drive/11ouSCU6wbUY5B4Q4vFStLyH8XUU8hqdN

TensorFlow Playground¶
A neural network is a function that learns the expected output for a given input from training datasets.

How does this work? For example, to build a neural network that recognizes images of a cat, you train the network with a lot of sample cat images. The resulting network works as a function that takes a cat image as input and outputs the “cat” label.
So let’s look at a simple classification problem.

We define a threshold to determine in which group each data point belongs.

Neural networks do the same except we don’t define the weights, it learns the weights on its own. It has several neurons trying different functions to keep tweaking the weights to get the to best solution with the highest probability.

YOUR INSTRUCTIONS ARE BELOW – TO BE GRADED¶
A) Show me a screenshot of your set up in TensorFlow
B) Show me a screenshot of your analysis in Orange
In [ ]:
### YOUR CODE HERE – TO BE GRADED

End of Notebook¶
In [ ]:
#to output the entire results without having to type print
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = “all”
In [ ]:
#add scrolling to the slides
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update(‘livereveal’, {
‘width’: 1024,
‘height’: 768,
‘scroll’: True,
})
In [ ]: