CS代考 Hierarchical_Clustering

Hierarchical_Clustering

Hierarchical Clustering¶
Weather Station Clustering¶

Copyright By PowCoder代写 加微信 powcoder

Hierarchical clustering (HCA) is clustering analysis which seeks to build a hierarchy of clusters.

There are 2 general strategies for this:

Agglomerative (bottom up): Each datapoint begins in its own cluster, and these clusters merge with other clusters as one moves up the hierarchy.
Divisive (top down): Every datapoint begin in 1 cluster, and splits are performed recursively as one moves down the hierarchy.

Updated version of ‘Weather Station Clustering’ by & Polong Lin

**Table of Contents**

Data Cleaning
Data Visualization
Partial Dataset Example
Data Preprocessing
Clustering
Cluster Visualization

Downloading necessary libraries¶

#!pip install numpy
#!pip install pandas
#!pip install sklearn
#!pip install scipy

On Windows you may need to download and “pip install” basemap module from http://www.lfd.uci.edu/~gohlke/pythonlibs/#basemap

Importing necessary libraries¶

import csv
import random
import numpy as np
import pandas as pd

# Plotting
os.environ[‘PROJ_LIB’] = os.environ[‘CONDA_PREFIX’] + ‘\pkgs\proj-7.1.1-h7d85306_3\Library\share\proj’
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

# Machine Learning
from sklearn.preprocessing import normalize
import scipy.cluster.hierarchy as sch
from scipy import zeros as sci_zeros
from scipy.spatial.distance import euclidean

%matplotlib inline

We will be using data from Environment Canada about weather stations for the year of 2014.

Download data¶
The data that we will use in this notebook is currently hosted on box.ibm.com. We will download this file using the wget Python module. The code below will download the file and rename it to weather-stations20140101-20141231.csv and place it in the current working directory.

import wget
!pip install wget
import wget

filename = wget.download(‘https://ibm.box.com/shared/static/mv6g5p1wpmpvzoz6e5zgo47t44q8dvm0.csv’, out=’weather-stations20140101-20141231.csv’)
print(filename)

100% [………………………………………………………………….] 129821 / 129821weather-stations20140101-20141231 (2).csv

Read in data¶

df = pd.read_csv(‘weather-stations20140101-20141231.csv’)

Stn_Name Lat Long Prov Tm DwTm D Tx Tn … DwP P%N S_G Pd BS DwBS BS% HDD CDD Stn_No
0 CHEMAINUS 48.935 -123.742 BC 8.2 0.0 NaN 13.5 0.0 1.0 … 0.0 NaN 0.0 12.0 NaN NaN NaN 273.3 0.0 1011500
1 COWICHAN LAKE FORESTRY 48.824 -124.133 BC 7.0 0.0 3.0 15.0 0.0 -3.0 … 0.0 104.0 0.0 12.0 NaN NaN NaN 307.0 0.0 1012040
2 LAKE COWICHAN 48.829 -124.052 BC 6.8 13.0 2.8 16.0 9.0 -2.5 … 9.0 NaN NaN 11.0 NaN NaN NaN 168.1 0.0 1012055
3 DISCOVERY ISLAND 48.425 -123.226 BC NaN NaN NaN 12.5 0.0 NaN … NaN NaN NaN NaN NaN NaN NaN NaN NaN 1012475
4 DUNCAN KELVIN CREEK 48.735 -123.728 BC 7.7 2.0 3.4 14.5 2.0 -1.0 … 2.0 NaN NaN 11.0 NaN NaN NaN 267.7 0.0 1012573

5 rows × 25 columns

Data Structure¶
Here is the structure of the data imported:

Stn_Name:::: Station Name
Lat :::: Latitude (North + , degrees)
Long :::: Longitude (West – , degrees)
Prov :::: Province
Tm :::: Mean Temperature (°C)
DwTm :::: Days without Valid Mean Temperature
D :::: Mean Temperature difference from Normal (1981-2010) (°C)
Tx :::: Highest Monthly Maximum Temperature (°C)
DwTx :::: Days without Valid Maximum Temperature
Tn :::: Lowest Monthly Minimum Temperature (°C)
DwTn :::: Days without Valid Minimum Temperature
S :::: Snowfall (cm)
DwS :::: Days without Valid Snowfall
S%N :::: Percent of Normal (1981-2010) Snowfall
P :::: Total Precipitation (mm)
DwP :::: Days without Valid Precipitation
P%N :::: Percent of Normal (1981-2010) Precipitation
S_G :::: Snow on the ground at the end of the month (cm)
Pd :::: Number of days with Precipitation 1.0 mm or more
BS :::: (hours)
DwBS :::: Days without Valid
BS% :::: Percent of Normal (1981-2010)
HDD :::: Degree Days below 18 °C
CDD :::: Degree Days above 18 °C
Stn_No :::: Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically).
NA :::: Not Available

Data Cleaning¶
We will only be doing light cleaning of this dataset.

Let’s get some information regarding the nulls in the dataset:


RangeIndex: 1341 entries, 0 to 1340
Data columns (total 25 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Stn_Name 1341 non-null object
1 Lat 1341 non-null float64
2 Long 1341 non-null float64
3 Prov 1341 non-null object
4 Tm 1256 non-null float64
5 DwTm 1256 non-null float64
6 D 357 non-null float64
7 Tx 1260 non-null float64
8 DwTx 1260 non-null float64
9 Tn 1260 non-null float64
10 DwTn 1260 non-null float64
11 S 586 non-null float64
12 DwS 586 non-null float64
13 S%N 198 non-null float64
14 P 1227 non-null float64
15 DwP 1227 non-null float64
16 P%N 209 non-null float64
17 S_G 798 non-null float64
18 Pd 1227 non-null float64
19 BS 0 non-null float64
20 DwBS 0 non-null float64
21 BS% 0 non-null float64
22 HDD 1256 non-null float64
23 CDD 1256 non-null float64
24 Stn_No 1341 non-null object
dtypes: float64(22), object(3)
memory usage: 262.0+ KB

Drop some null rows/columns¶
We will drop the rows that are null in the Tm,Tn,Tx,xm, and ym columns, we will also drop the columns BS,DwBS, and BS% because they have no values.

# Drop Tm’,’Tn’,’Tx’,’xm’,’ym’ null rows
df = df[np.isfinite(df[‘Tm’])]
df = df[np.isfinite(df[‘Tn’])]
df = df[np.isfinite(df[‘Tx’])]
# df = df[np.isfinite(df[‘xm’])]
# df = df[np.isfinite(df[‘ym’])]

# Drop BS,DwBS, and BS% columns
df.drop([‘BS’,’DwBS’,’BS%’],axis=1,inplace=True)

df.reset_index(drop=True, inplace=True)


Int64Index: 1255 entries, 0 to 1340
Data columns (total 22 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Stn_Name 1255 non-null object
1 Lat 1255 non-null float64
2 Long 1255 non-null float64
3 Prov 1255 non-null object
4 Tm 1255 non-null float64
5 DwTm 1255 non-null float64
6 D 357 non-null float64
7 Tx 1255 non-null float64
8 DwTx 1255 non-null float64
9 Tn 1255 non-null float64
10 DwTn 1255 non-null float64
11 S 511 non-null float64
12 DwS 511 non-null float64
13 S%N 182 non-null float64
14 P 1143 non-null float64
15 DwP 1143 non-null float64
16 P%N 193 non-null float64
17 S_G 733 non-null float64
18 Pd 1143 non-null float64
19 HDD 1255 non-null float64
20 CDD 1255 non-null float64
21 Stn_No 1255 non-null object
dtypes: float64(19), object(3)
memory usage: 225.5+ KB

Data Visualization¶

plt.figure(figsize=(14,10))

Long = [-140,-50] # Longitude Range
Lat = [40,65] # Latitude Range

# Query dataframe for long/lat in the above range
(df[‘Long’] > Long[0])
& (df[‘Long’] < Long[1]) & (df['Lat'] > Lat[0])
& (df[‘Lat’] < Lat[1]) # Create basemap map my_map = Basemap( projection='merc', resolution='l', area_thresh = 1000.0, llcrnrlon = Long[0], # Lower latitude urcrnrlon = Long[1], # Upper longitude llcrnrlat = Lat[0], # Lower latitude urcrnrlat = Lat[1] # Upper latitude # Basemap map drawing parameters my_map.drawcoastlines() my_map.drawcountries() my_map.drawmapboundary() my_map.fillcontinents(color='green',alpha=0.3) my_map.shadedrelief() # Get x,y position of points on map using my_map my_longs = df.Long.values my_lats = df.Lat.values X,Y = my_map(my_longs, my_lats) # Add x,y to dataframe df['xm'] = X df['ym'] = Y # Draw weather stations on map: for (x,y) in zip(X,Y): my_map.plot(x,y, markerfacecolor=([1,0,0]), marker = 'o', markersize = 5, alpha = 0.75) Partial Dataset Example¶ Let's try heirarchical clustering an random sample of 30 points from the dataset and plot the results. n_samples = 30 # samples to grab sDF = df.sample(n=n_samples) sDF = sDF.reset_index(drop=True) Preprocessing¶ nTemp = normalize(np.matrix(sDF.Tm.values), axis=1) nTemp = nTemp[0] # convert to 1D array array([-0.19738362, -0.10648327, -0.10778185, -0.31944981, -0.01428434, -0.1986822 , -0.01038861, -0.21296654, -0.04934591, -0.14284341, 0.07012313, -0.2701039 , -0.21686227, -0.17011352, -0.10518469, -0.18959216, -0.21296654, -0.14154483, 0.05194306, -0.05064448, -0.20257793, -0.26620817, -0.19998078, -0.27270106, 0.02727011, -0.29477686, -0.2649096 , -0.03506156, -0.22465373, -0.14414199]) Calculate element-wise temperature differences¶ # empty matrix is fill D = np.zeros([nTemp.size,nTemp.size]) # Find all element wise temp differences for i in range(nTemp.size): for j in range(nTemp.size): D[i,j] = abs(nTemp[i]-nTemp[j]) Hierarchical Clustering¶ Y = sch.linkage(D, method='centroid') C:\Users\roman\AppData\Local\Temp/ipykernel_14324/251272051.py:1: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix Y = sch.linkage(D, method='centroid') Plot 1st Dendrogram¶ fig = plt.figure(figsize=(12,12)) ax1 = fig.add_axes([0.1,0.1,0.4,0.6]) Z1 = sch.dendrogram(Y, orientation='right') labels = zip(map(lambda x: round(x,2), sDF['Tm'][Z1['leaves']]), sDF['Stn_Name'][Z1['leaves']]) ax1.set_xticks([]) ax1.set_yticklabels(labels) plt.plot() # supress prints Data Preprocessing¶ Now, we will continue w/ heirarchical clustering with the entire dataset. First, we will normalize the data: X = normalize(np.matrix(df.xm.values), axis=1)[0] Y = normalize(np.matrix(df.ym.values), axis=1)[0] Tm = normalize(np.matrix(df.Tm.values), axis=1)[0] Tn = normalize(np.matrix(df.Tn.values), axis=1)[0] Tx = normalize(np.matrix(df.Tx.values), axis=1)[0] data=zip(Tm,Tn,Tx,X,Y) # Grab values from DF data = normalize(np.matrix(df[['Tm','Tn','Tx','xm','ym']].values), axis=1) print('Shape:',data.shape) print(data) Shape: (1188, 5) [[ 3.58976212e-06 4.37775868e-07 5.90997422e-06 7.91413958e-01 6.11280580e-01] [ 3.12720065e-06 -1.34022885e-06 6.70114424e-06 7.88201573e-01 6.15417160e-01] [ 3.02754010e-06 -1.11306621e-06 7.12362377e-06 7.89536091e-01 6.13704131e-01] [-2.66011766e-06 -3.52138526e-06 -1.47178641e-06 9.38458983e-01 3.45390701e-01] [-2.42555457e-06 -3.45747910e-06 -6.17027039e-07 9.65771991e-01 2.59392487e-01] [-3.29796352e-06 -5.57201057e-06 -1.61921675e-06 9.68224599e-01 2.50082240e-01]] Calculate tuple-wise distances¶ D = np.zeros([data.shape[0],data.shape[0]]) for i in range(data.shape[0]): for j in range (data.shape[0]): D[i,j] = euclidean(data[i],data[j]) array([[0. , 0.00523743, 0.00306594, ..., 0.30384152, 0.39271612, 0.40215202], [0.00523743, 0. , 0.0021715 , ..., 0.30901712, 0.39785025, 0.4072811 ], [0.00306594, 0.0021715 , 0. , ..., 0.30687151, 0.39572191, 0.40515486], [0.30384152, 0.30901712, 0.30687151, ..., 0. , 0.09023133, 0.09984836], [0.39271612, 0.39785025, 0.39572191, ..., 0.09023133, 0. , 0.00962788], [0.40215202, 0.4072811 , 0.40515486, ..., 0.09984836, 0.00962788, 0. ]]) Clustering¶ Y = sch.linkage(D, method='centroid') C:\Users\roman\AppData\Local\Temp/ipykernel_14324/251272051.py:1: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix Y = sch.linkage(D, method='centroid') Cluster Visualization¶ Visualize Dendrograms¶ We will visualize the dendrograms, like we did before with just the sample database. fig = plt.figure(figsize=(12,12)) ax1 = fig.add_axes([0.1,0.1,0.4,0.6]) # Get dendrograms Z1 = sch.dendrogram(Y, orientation='right') # Get labels labels = zip( map(lambda x: round(x,2), df.Tx.iloc[Z1['leaves']]), map(lambda x: round(x,2), df.Tm.iloc[Z1['leaves']]), map(lambda x: round(x,2), df.Tn.iloc[Z1['leaves']]), df['Stn_Name'].iloc[Z1['leaves']], map(lambda x: round(x,2), df.Lat.iloc[Z1['leaves']]), map(lambda x: round(x,2), df.Long.iloc[Z1['leaves']]) ax1.set_xticks([]) ax1.set_yticklabels(labels) plt.plot() Get labels¶ Get clustering results and append to dataframe labels = sch.fcluster(Y, 0.8*D.max(), 'distance') df["hier_Clusters"]=labels-1 df[["Stn_Name","Tx","Tm","hier_Clusters"]].head() Stn_Name Tx Tm hier_Clusters 0 CHEMAINUS 13.5 8.2 22 1 COWICHAN LAKE FORESTRY 15.0 7.0 22 2 LAKE COWICHAN 16.0 6.8 22 3 DUNCAN KELVIN CREEK 14.5 7.7 22 4 ESQUIMALT HARBOUR 13.1 8.8 23 Visualize on Map¶ Now we will visualize the results on an actual map plt.figure(figsize=(14,10)) Long = [-140,-50] # Longitude Range Lat = [40,65] # Latitude Range # Query dataframe for long/lat in the above range (df['Long'] > Long[0])
& (df[‘Long’] < Long[1]) & (df['Lat'] > Lat[0])
& (df[‘Lat’] < Lat[1]) # Create basemap map my_map = Basemap( projection='merc', resolution='l', area_thresh = 1000.0, llcrnrlon = Long[0], # Lower latitude urcrnrlon = Long[1], # Upper longitude llcrnrlat = Lat[0], # Lower latitude urcrnrlat = Lat[1] # Upper latitude # Basemap map drawing parameters my_map.drawcoastlines() my_map.drawcountries() my_map.drawmapboundary() my_map.fillcontinents(color='green',alpha=0.3) my_map.shadedrelief() # Create color map colors = plt.get_cmap('jet')(np.linspace(0.0, 1.0, len(set(df.hier_Clusters.values)))) # Plot x,y points, and color by cluster for index,row in df.iterrows(): my_map.plot(row.xm, row.ym, markerfacecolor=colors[row.hier_Clusters], marker = 'o', markersize = 5, alpha = 0.75 # Label clusters for i in range(len(set(labels))): cluster = df[df.hier_Clusters == i][["Stn_Name","Tm","xm","ym","hier_Clusters"]] # Get centroid of cluster xc = np.mean(cluster.xm) yc = np.mean(cluster.ym) # Get mean temp of cluster Tavg = np.mean(cluster.Tm) # label cluster on map plt.text(xc,yc,str(i),fontsize=30,color='red') # Print average temperatures print ("Cluster "+str(i)+', Avg Temp: '+ str(np.mean(cluster.Tm))) Cluster 0, Avg Temp: -17.86 Cluster 1, Avg Temp: -14.65 Cluster 2, Avg Temp: -13.15 Cluster 3, Avg Temp: -12.866666666666667 Cluster 4, Avg Temp: -13.799999999999999 Cluster 5, Avg Temp: 2.7624999999999993 Cluster 6, Avg Temp: -2.055555555555555 Cluster 7, Avg Temp: -4.45 Cluster 8, Avg Temp: -7.336363636363635 Cluster 9, Avg Temp: -7.825 Cluster 10, Avg Temp: -6.111111111111111 Cluster 11, Avg Temp: -17.680769230769222 Cluster 12, Avg Temp: -10.972222222222223 Cluster 13, Avg Temp: -14.364687499999999 Cluster 14, Avg Temp: -20.552727272727278 Cluster 15, Avg Temp: -22.6125 Cluster 16, Avg Temp: -23.210526315789473 Cluster 17, Avg Temp: -4.43793103448276 Cluster 18, Avg Temp: -8.569230769230769 Cluster 19, Avg Temp: -6.422727272727273 Cluster 20, Avg Temp: -2.252380952380952 Cluster 21, Avg Temp: 1.0960784313725487 Cluster 22, Avg Temp: -4.0606060606060606 Cluster 23, Avg Temp: -4.856701030927836 Cluster 24, Avg Temp: -12.170491803278688 Cluster 25, Avg Temp: -18.693478260869565 Cluster 26, Avg Temp: -10.239436619718312 Cluster 27, Avg Temp: -7.618421052631579 Cluster 28, Avg Temp: -10.95 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com