Hierarchical_Clustering
Hierarchical Clustering¶
Weather Station Clustering¶
Copyright By PowCoder代写 加微信 powcoder
Hierarchical clustering (HCA) is clustering analysis which seeks to build a hierarchy of clusters.
There are 2 general strategies for this:
Agglomerative (bottom up): Each datapoint begins in its own cluster, and these clusters merge with other clusters as one moves up the hierarchy.
Divisive (top down): Every datapoint begin in 1 cluster, and splits are performed recursively as one moves down the hierarchy.
Updated version of ‘Weather Station Clustering’ by & Polong Lin
**Table of Contents**
Data Cleaning
Data Visualization
Partial Dataset Example
Data Preprocessing
Clustering
Cluster Visualization
Downloading necessary libraries¶
#!pip install numpy
#!pip install pandas
#!pip install sklearn
#!pip install scipy
On Windows you may need to download and “pip install” basemap module from http://www.lfd.uci.edu/~gohlke/pythonlibs/#basemap
Importing necessary libraries¶
import csv
import random
import numpy as np
import pandas as pd
# Plotting
os.environ[‘PROJ_LIB’] = os.environ[‘CONDA_PREFIX’] + ‘\pkgs\proj-7.1.1-h7d85306_3\Library\share\proj’
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
# Machine Learning
from sklearn.preprocessing import normalize
import scipy.cluster.hierarchy as sch
from scipy import zeros as sci_zeros
from scipy.spatial.distance import euclidean
%matplotlib inline
We will be using data from Environment Canada about weather stations for the year of 2014.
Download data¶
The data that we will use in this notebook is currently hosted on box.ibm.com. We will download this file using the wget Python module. The code below will download the file and rename it to weather-stations20140101-20141231.csv and place it in the current working directory.
import wget
!pip install wget
import wget
filename = wget.download(‘https://ibm.box.com/shared/static/mv6g5p1wpmpvzoz6e5zgo47t44q8dvm0.csv’, out=’weather-stations20140101-20141231.csv’)
print(filename)
100% [………………………………………………………………….] 129821 / 129821weather-stations20140101-20141231 (2).csv
Read in data¶
df = pd.read_csv(‘weather-stations20140101-20141231.csv’)
Stn_Name Lat Long Prov Tm DwTm D Tx Tn … DwP P%N S_G Pd BS DwBS BS% HDD CDD Stn_No
0 CHEMAINUS 48.935 -123.742 BC 8.2 0.0 NaN 13.5 0.0 1.0 … 0.0 NaN 0.0 12.0 NaN NaN NaN 273.3 0.0 1011500
1 COWICHAN LAKE FORESTRY 48.824 -124.133 BC 7.0 0.0 3.0 15.0 0.0 -3.0 … 0.0 104.0 0.0 12.0 NaN NaN NaN 307.0 0.0 1012040
2 LAKE COWICHAN 48.829 -124.052 BC 6.8 13.0 2.8 16.0 9.0 -2.5 … 9.0 NaN NaN 11.0 NaN NaN NaN 168.1 0.0 1012055
3 DISCOVERY ISLAND 48.425 -123.226 BC NaN NaN NaN 12.5 0.0 NaN … NaN NaN NaN NaN NaN NaN NaN NaN NaN 1012475
4 DUNCAN KELVIN CREEK 48.735 -123.728 BC 7.7 2.0 3.4 14.5 2.0 -1.0 … 2.0 NaN NaN 11.0 NaN NaN NaN 267.7 0.0 1012573
5 rows × 25 columns
Data Structure¶
Here is the structure of the data imported:
Stn_Name:::: Station Name
Lat :::: Latitude (North + , degrees)
Long :::: Longitude (West – , degrees)
Prov :::: Province
Tm :::: Mean Temperature (°C)
DwTm :::: Days without Valid Mean Temperature
D :::: Mean Temperature difference from Normal (1981-2010) (°C)
Tx :::: Highest Monthly Maximum Temperature (°C)
DwTx :::: Days without Valid Maximum Temperature
Tn :::: Lowest Monthly Minimum Temperature (°C)
DwTn :::: Days without Valid Minimum Temperature
S :::: Snowfall (cm)
DwS :::: Days without Valid Snowfall
S%N :::: Percent of Normal (1981-2010) Snowfall
P :::: Total Precipitation (mm)
DwP :::: Days without Valid Precipitation
P%N :::: Percent of Normal (1981-2010) Precipitation
S_G :::: Snow on the ground at the end of the month (cm)
Pd :::: Number of days with Precipitation 1.0 mm or more
BS :::: (hours)
DwBS :::: Days without Valid
BS% :::: Percent of Normal (1981-2010)
HDD :::: Degree Days below 18 °C
CDD :::: Degree Days above 18 °C
Stn_No :::: Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically).
NA :::: Not Available
Data Cleaning¶
We will only be doing light cleaning of this dataset.
Let’s get some information regarding the nulls in the dataset:
RangeIndex: 1341 entries, 0 to 1340
Data columns (total 25 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Stn_Name 1341 non-null object
1 Lat 1341 non-null float64
2 Long 1341 non-null float64
3 Prov 1341 non-null object
4 Tm 1256 non-null float64
5 DwTm 1256 non-null float64
6 D 357 non-null float64
7 Tx 1260 non-null float64
8 DwTx 1260 non-null float64
9 Tn 1260 non-null float64
10 DwTn 1260 non-null float64
11 S 586 non-null float64
12 DwS 586 non-null float64
13 S%N 198 non-null float64
14 P 1227 non-null float64
15 DwP 1227 non-null float64
16 P%N 209 non-null float64
17 S_G 798 non-null float64
18 Pd 1227 non-null float64
19 BS 0 non-null float64
20 DwBS 0 non-null float64
21 BS% 0 non-null float64
22 HDD 1256 non-null float64
23 CDD 1256 non-null float64
24 Stn_No 1341 non-null object
dtypes: float64(22), object(3)
memory usage: 262.0+ KB
Drop some null rows/columns¶
We will drop the rows that are null in the Tm,Tn,Tx,xm, and ym columns, we will also drop the columns BS,DwBS, and BS% because they have no values.
# Drop Tm’,’Tn’,’Tx’,’xm’,’ym’ null rows
df = df[np.isfinite(df[‘Tm’])]
df = df[np.isfinite(df[‘Tn’])]
df = df[np.isfinite(df[‘Tx’])]
# df = df[np.isfinite(df[‘xm’])]
# df = df[np.isfinite(df[‘ym’])]
# Drop BS,DwBS, and BS% columns
df.drop([‘BS’,’DwBS’,’BS%’],axis=1,inplace=True)
df.reset_index(drop=True, inplace=True)
Int64Index: 1255 entries, 0 to 1340
Data columns (total 22 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Stn_Name 1255 non-null object
1 Lat 1255 non-null float64
2 Long 1255 non-null float64
3 Prov 1255 non-null object
4 Tm 1255 non-null float64
5 DwTm 1255 non-null float64
6 D 357 non-null float64
7 Tx 1255 non-null float64
8 DwTx 1255 non-null float64
9 Tn 1255 non-null float64
10 DwTn 1255 non-null float64
11 S 511 non-null float64
12 DwS 511 non-null float64
13 S%N 182 non-null float64
14 P 1143 non-null float64
15 DwP 1143 non-null float64
16 P%N 193 non-null float64
17 S_G 733 non-null float64
18 Pd 1143 non-null float64
19 HDD 1255 non-null float64
20 CDD 1255 non-null float64
21 Stn_No 1255 non-null object
dtypes: float64(19), object(3)
memory usage: 225.5+ KB
Data Visualization¶
plt.figure(figsize=(14,10))
Long = [-140,-50] # Longitude Range
Lat = [40,65] # Latitude Range
# Query dataframe for long/lat in the above range
(df[‘Long’] > Long[0])
& (df[‘Long’] < Long[1])
& (df['Lat'] > Lat[0])
& (df[‘Lat’] < Lat[1])
# Create basemap map
my_map = Basemap(
projection='merc',
resolution='l',
area_thresh = 1000.0,
llcrnrlon = Long[0], # Lower latitude
urcrnrlon = Long[1], # Upper longitude
llcrnrlat = Lat[0], # Lower latitude
urcrnrlat = Lat[1] # Upper latitude
# Basemap map drawing parameters
my_map.drawcoastlines()
my_map.drawcountries()
my_map.drawmapboundary()
my_map.fillcontinents(color='green',alpha=0.3)
my_map.shadedrelief()
# Get x,y position of points on map using my_map
my_longs = df.Long.values
my_lats = df.Lat.values
X,Y = my_map(my_longs, my_lats)
# Add x,y to dataframe
df['xm'] = X
df['ym'] = Y
# Draw weather stations on map:
for (x,y) in zip(X,Y):
my_map.plot(x,y,
markerfacecolor=([1,0,0]),
marker = 'o',
markersize = 5,
alpha = 0.75)
Partial Dataset Example¶
Let's try heirarchical clustering an random sample of 30 points from the dataset and plot the results.
n_samples = 30 # samples to grab
sDF = df.sample(n=n_samples)
sDF = sDF.reset_index(drop=True)
Preprocessing¶
nTemp = normalize(np.matrix(sDF.Tm.values), axis=1)
nTemp = nTemp[0] # convert to 1D array
array([-0.19738362, -0.10648327, -0.10778185, -0.31944981, -0.01428434,
-0.1986822 , -0.01038861, -0.21296654, -0.04934591, -0.14284341,
0.07012313, -0.2701039 , -0.21686227, -0.17011352, -0.10518469,
-0.18959216, -0.21296654, -0.14154483, 0.05194306, -0.05064448,
-0.20257793, -0.26620817, -0.19998078, -0.27270106, 0.02727011,
-0.29477686, -0.2649096 , -0.03506156, -0.22465373, -0.14414199])
Calculate element-wise temperature differences¶
# empty matrix is fill
D = np.zeros([nTemp.size,nTemp.size])
# Find all element wise temp differences
for i in range(nTemp.size):
for j in range(nTemp.size):
D[i,j] = abs(nTemp[i]-nTemp[j])
Hierarchical Clustering¶
Y = sch.linkage(D, method='centroid')
C:\Users\roman\AppData\Local\Temp/ipykernel_14324/251272051.py:1: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
Y = sch.linkage(D, method='centroid')
Plot 1st Dendrogram¶
fig = plt.figure(figsize=(12,12))
ax1 = fig.add_axes([0.1,0.1,0.4,0.6])
Z1 = sch.dendrogram(Y, orientation='right')
labels = zip(map(lambda x: round(x,2),
sDF['Tm'][Z1['leaves']]),
sDF['Stn_Name'][Z1['leaves']])
ax1.set_xticks([])
ax1.set_yticklabels(labels)
plt.plot() # supress prints
Data Preprocessing¶
Now, we will continue w/ heirarchical clustering with the entire dataset. First, we will normalize the data:
X = normalize(np.matrix(df.xm.values), axis=1)[0]
Y = normalize(np.matrix(df.ym.values), axis=1)[0]
Tm = normalize(np.matrix(df.Tm.values), axis=1)[0]
Tn = normalize(np.matrix(df.Tn.values), axis=1)[0]
Tx = normalize(np.matrix(df.Tx.values), axis=1)[0]
data=zip(Tm,Tn,Tx,X,Y)
# Grab values from DF
data = normalize(np.matrix(df[['Tm','Tn','Tx','xm','ym']].values), axis=1)
print('Shape:',data.shape)
print(data)
Shape: (1188, 5)
[[ 3.58976212e-06 4.37775868e-07 5.90997422e-06 7.91413958e-01
6.11280580e-01]
[ 3.12720065e-06 -1.34022885e-06 6.70114424e-06 7.88201573e-01
6.15417160e-01]
[ 3.02754010e-06 -1.11306621e-06 7.12362377e-06 7.89536091e-01
6.13704131e-01]
[-2.66011766e-06 -3.52138526e-06 -1.47178641e-06 9.38458983e-01
3.45390701e-01]
[-2.42555457e-06 -3.45747910e-06 -6.17027039e-07 9.65771991e-01
2.59392487e-01]
[-3.29796352e-06 -5.57201057e-06 -1.61921675e-06 9.68224599e-01
2.50082240e-01]]
Calculate tuple-wise distances¶
D = np.zeros([data.shape[0],data.shape[0]])
for i in range(data.shape[0]):
for j in range (data.shape[0]):
D[i,j] = euclidean(data[i],data[j])
array([[0. , 0.00523743, 0.00306594, ..., 0.30384152, 0.39271612,
0.40215202],
[0.00523743, 0. , 0.0021715 , ..., 0.30901712, 0.39785025,
0.4072811 ],
[0.00306594, 0.0021715 , 0. , ..., 0.30687151, 0.39572191,
0.40515486],
[0.30384152, 0.30901712, 0.30687151, ..., 0. , 0.09023133,
0.09984836],
[0.39271612, 0.39785025, 0.39572191, ..., 0.09023133, 0. ,
0.00962788],
[0.40215202, 0.4072811 , 0.40515486, ..., 0.09984836, 0.00962788,
0. ]])
Clustering¶
Y = sch.linkage(D, method='centroid')
C:\Users\roman\AppData\Local\Temp/ipykernel_14324/251272051.py:1: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
Y = sch.linkage(D, method='centroid')
Cluster Visualization¶
Visualize Dendrograms¶
We will visualize the dendrograms, like we did before with just the sample database.
fig = plt.figure(figsize=(12,12))
ax1 = fig.add_axes([0.1,0.1,0.4,0.6])
# Get dendrograms
Z1 = sch.dendrogram(Y, orientation='right')
# Get labels
labels = zip(
map(lambda x: round(x,2), df.Tx.iloc[Z1['leaves']]),
map(lambda x: round(x,2), df.Tm.iloc[Z1['leaves']]),
map(lambda x: round(x,2), df.Tn.iloc[Z1['leaves']]),
df['Stn_Name'].iloc[Z1['leaves']],
map(lambda x: round(x,2), df.Lat.iloc[Z1['leaves']]),
map(lambda x: round(x,2), df.Long.iloc[Z1['leaves']])
ax1.set_xticks([])
ax1.set_yticklabels(labels)
plt.plot()
Get labels¶
Get clustering results and append to dataframe
labels = sch.fcluster(Y, 0.8*D.max(), 'distance')
df["hier_Clusters"]=labels-1
df[["Stn_Name","Tx","Tm","hier_Clusters"]].head()
Stn_Name Tx Tm hier_Clusters
0 CHEMAINUS 13.5 8.2 22
1 COWICHAN LAKE FORESTRY 15.0 7.0 22
2 LAKE COWICHAN 16.0 6.8 22
3 DUNCAN KELVIN CREEK 14.5 7.7 22
4 ESQUIMALT HARBOUR 13.1 8.8 23
Visualize on Map¶
Now we will visualize the results on an actual map
plt.figure(figsize=(14,10))
Long = [-140,-50] # Longitude Range
Lat = [40,65] # Latitude Range
# Query dataframe for long/lat in the above range
(df['Long'] > Long[0])
& (df[‘Long’] < Long[1])
& (df['Lat'] > Lat[0])
& (df[‘Lat’] < Lat[1])
# Create basemap map
my_map = Basemap(
projection='merc',
resolution='l',
area_thresh = 1000.0,
llcrnrlon = Long[0], # Lower latitude
urcrnrlon = Long[1], # Upper longitude
llcrnrlat = Lat[0], # Lower latitude
urcrnrlat = Lat[1] # Upper latitude
# Basemap map drawing parameters
my_map.drawcoastlines()
my_map.drawcountries()
my_map.drawmapboundary()
my_map.fillcontinents(color='green',alpha=0.3)
my_map.shadedrelief()
# Create color map
colors = plt.get_cmap('jet')(np.linspace(0.0, 1.0, len(set(df.hier_Clusters.values))))
# Plot x,y points, and color by cluster
for index,row in df.iterrows():
my_map.plot(row.xm, row.ym,
markerfacecolor=colors[row.hier_Clusters],
marker = 'o',
markersize = 5,
alpha = 0.75
# Label clusters
for i in range(len(set(labels))):
cluster = df[df.hier_Clusters == i][["Stn_Name","Tm","xm","ym","hier_Clusters"]]
# Get centroid of cluster
xc = np.mean(cluster.xm)
yc = np.mean(cluster.ym)
# Get mean temp of cluster
Tavg = np.mean(cluster.Tm)
# label cluster on map
plt.text(xc,yc,str(i),fontsize=30,color='red')
# Print average temperatures
print ("Cluster "+str(i)+', Avg Temp: '+ str(np.mean(cluster.Tm)))
Cluster 0, Avg Temp: -17.86
Cluster 1, Avg Temp: -14.65
Cluster 2, Avg Temp: -13.15
Cluster 3, Avg Temp: -12.866666666666667
Cluster 4, Avg Temp: -13.799999999999999
Cluster 5, Avg Temp: 2.7624999999999993
Cluster 6, Avg Temp: -2.055555555555555
Cluster 7, Avg Temp: -4.45
Cluster 8, Avg Temp: -7.336363636363635
Cluster 9, Avg Temp: -7.825
Cluster 10, Avg Temp: -6.111111111111111
Cluster 11, Avg Temp: -17.680769230769222
Cluster 12, Avg Temp: -10.972222222222223
Cluster 13, Avg Temp: -14.364687499999999
Cluster 14, Avg Temp: -20.552727272727278
Cluster 15, Avg Temp: -22.6125
Cluster 16, Avg Temp: -23.210526315789473
Cluster 17, Avg Temp: -4.43793103448276
Cluster 18, Avg Temp: -8.569230769230769
Cluster 19, Avg Temp: -6.422727272727273
Cluster 20, Avg Temp: -2.252380952380952
Cluster 21, Avg Temp: 1.0960784313725487
Cluster 22, Avg Temp: -4.0606060606060606
Cluster 23, Avg Temp: -4.856701030927836
Cluster 24, Avg Temp: -12.170491803278688
Cluster 25, Avg Temp: -18.693478260869565
Cluster 26, Avg Temp: -10.239436619718312
Cluster 27, Avg Temp: -7.618421052631579
Cluster 28, Avg Temp: -10.95
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com