Week9 SparkML
COMP5349 Week 9 Learning Example¶
This is a sample notebook showing how to use Learning library. In particular, it shows how to prepare data as input to a machine learning model and how to convert the output back to local data structure for further processing, such as visualization.
Copyright By PowCoder代写 加微信 powcoder
The program uses the classic MINST dataset of handwritten digit [http://yann.lecun.com/exdb/mnist/]. To facilitate processing, we have converted the original data set into corresponding csv files.
The notebook demonstrates the usage of two algrothms: PCA and KMeans. It uses the 1/10 of the test data set, which is small enough to run on a single machine.
Lab Exercises¶
Try to run the orginal notebook and observe the output.
Organize the PCA-KMEANS sequence into a pipeline. Then, run the whole pipeline to produce a model and use that model to compute cluster membership.
Use PCA to project the original features to three or more dimensions and use the new PCA results to run the kmeans algorithm.
%pip install pyspark
Requirement already satisfied: pyspark in c:\users\ty536\appdata\local\programs\python\python39\lib\site-packages (3.2.1)
Requirement already satisfied: py4j==0.10.9.3 in c:\users\ty536\appdata\local\programs\python\python39\lib\site-packages (from pyspark) (0.10.9.3)
Note: you may need to restart the kernel to use updated packages.
WARNING: You are using pip version 22.0.4; however, version 22.1 is available.
You should consider upgrading via the ‘c:\Users\ty536\AppData\Local\Programs\Python\Python39\python.exe -m pip install –upgrade pip’ command.
# Import all necessary libraries and setup the environment for matplotlib
from pyspark.sql import SparkSession
from pyspark.ml.feature import PCA
from pyspark.ml.clustering import KMeans
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
import numpy as np
import matplotlib.pyplot as plt
Load the data in csv file¶
The following cell loads the image data from a headless csv file (line 10). Each row of the csv file contains the raw pixel value of an image. The value is an integer beween 0-255. 0 means background (white) and 255 means foreground (black). The image has 28×28 pixels. It is flattened to a vector of 748 values. Hence, each row contains 748 columns. Spark is instructed to infer the schema from files.
The data file is loaded initially as a data frame consisting of rows. Each row has 748 columns with default names such as _c0, _c1,…
input_path = ‘./’
spark = SparkSession \
.builder \
.appName(“Python Learning Example”) \
.getOrCreate()
test_datafile = input_path + “Test-1000-data.csv”
test_labelfile= input_path + “Test-1000-label.csv”
test_df = spark.read.csv(test_datafile,header=False,inferSchema=”true”)
Converting to Vector row¶
Most machine learning algorithms expect individual input data to be a vector representing features of data points. Instead of having 748 columns each of integer type per row, we need a single column of vector type per row. The column is usually called “features”. Spark provides a mechanism called VectorAssembler to combine a number of columns into a single vector column [https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/ml-features.html#vectorassembler].
The following cell combines all columns in the dataframe into a single column called features of type Vector. The value of the first two rows are displayed. You may notice that Spark has chosen to use SparseVector format to represent the image pixel values. All images in MNIST have large portions of white background, meaning a pixel value of 0. For instance, the first row has 0s in the first 202 features; feature 202, 203, 204, etc have non-zero values.
assembler = VectorAssembler(inputCols=test_df.columns,
outputCol=”features”)
test_vectors = assembler.transform(test_df).select(“features”)
test_vectors.show(2)
+——————–+
| features|
+——————–+
|(784,[131,132,157…|
|(784,[125,126,127…|
+——————–+
only showing top 2 rows
Display the actual digit¶
The following cells take the first row of the data frame and return its value to the driver program. It is converted into a numpy array, reshaped to a 28×28 matrix, and displayed as a grey scale image.
first_digit = np.array(test_vectors.head(1))
plt.figure(figsize=(1,1))
plt.imshow(first_digit[0].reshape(28,28), cmap=”gray_r”)
plt.axis(‘off’)
(-0.5, 27.5, 27.5, -0.5)
Project data point in a two dimensional space¶
The following cell uses PCA transformation to project the 748 feature vectors into 2-dimensional principal components. Line 1 creates a PCA instance. Line 2 trains the moddel with the given data set: test_vectors. Line 3 uses the trained model to perform actual transformation on the data set: test_vectors. We are only interested in the transformed vector so we select the column that corresponds to the output pca
pca = PCA(k=2, inputCol=”features”, outputCol=”pca”)
model = pca.fit(test_vectors)
pca_result = model.transform(test_vectors).select(‘pca’)
Inspect the PCA projected vectors¶
Next, we collect the PCA results as a local variable and display the top five vectors.
local_pca=np.array(pca_result.collect())
local_pca[:5]
array([[[ 380.6113932 , -1001.53056022]],
[[ 491.50243931, 137.02761024]],
[[ -37.79790919, 237.76285625]],
[[ 744.24227215, 133.86798112]],
[[ 1371.50690043, 442.66543942]]])
#reshape it to 2d array
local_pca=local_pca.reshape((local_pca.shape[0],2))
local_pca[:5]
array([[ 380.6113932 , -1001.53056022],
[ 491.50243931, 137.02761024],
[ -37.79790919, 237.76285625],
[ 744.24227215, 133.86798112],
[ 1371.50690043, 442.66543942]])
Read label file¶
The next cell reads the label file and converts it into a list. We use Spark again for demonstration purposes. The label data is only used locally and should be read using a normal Python package.
label_list = spark.read.csv(test_labelfile,header=False,inferSchema=”true”).collect()
labels=np.array([r[‘_c0′] for r in label_list])
labels[:5]
array([4, 8, 1, 6, 3])
Visualize the 2-d projected vector¶
Visualization is an easy way to check if the performed PCA projection is a good representation of the original data. We plot the data in a scatterplot; data points are painted in different colours based on their label. The data set consists of 10 classes. We expect to see some natural clusterings if the 2-D projection preserves important original features. The visualization shows that for labels 1 and 0, there are some natural clusterings but data points of all other lables are scattered. It suggests that the first two principle components may not be a good representation. we may need to take more components if we want to use PCA as dimension reduction tool.
plt.figure(figsize=(7,7))
plt.scatter(local_pca[:, 0], local_pca[:, 1], s=20, c=labels,cmap=’rainbow’)
for i in range(10):
# Position of each label.
xtext, ytext = np.median(local_pca[labels == i, :], axis=0)
txt = plt.text(xtext, ytext, str(i), fontsize=24)
txts.append(txt)
KMeans Example¶
The next cell demonstrates the usage of another algorithm K-Means [https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/ml-clustering.html]. To facilitate visualization, we run k-means on the projected 2-d data set. In line 1, the KMeans instance is created and it uses features stored in the “pca” column as specified. We apply the same sequence of fit followed by transform. After running transform, a new colum called prediction will be added to each data point. The column saves the cluser membership of that data point, which is indicated as an integer cluster id.
kmeans = KMeans(featuresCol=’pca’,k=10)
model=kmeans.fit(pca_result)
predictions = model.transform(pca_result)
centers = model.clusterCenters()
Visuazliation the K-means Results¶
The next few cells collect the results locally and formats them for visualization.
local_kmeans_data= predictions.select(‘pca’,’prediction’).collect()
local_kmeans_data[:10]
[Row(pca=DenseVector([380.6114, -1001.5306]), prediction=4),
Row(pca=DenseVector([491.5024, 137.0276]), prediction=5),
Row(pca=DenseVector([-37.7979, 237.7629]), prediction=3),
Row(pca=DenseVector([744.2423, 133.868]), prediction=8),
Row(pca=DenseVector([1371.5069, 442.6654]), prediction=7),
Row(pca=DenseVector([282.4655, -429.1815]), prediction=9),
Row(pca=DenseVector([476.4517, -720.5809]), prediction=4),
Row(pca=DenseVector([1654.7591, 272.4142]), prediction=6),
Row(pca=DenseVector([320.4858, -612.633]), prediction=9),
Row(pca=DenseVector([891.763, -17.0652]), prediction=1)]
kmeans_points = [v.pca.values.tolist() for v in local_kmeans_data]
kmeans_points[:10]
[[380.6113932027025, -1001.5305602156552],
[491.50243931208615, 137.02761023786408],
[-37.797909188915646, 237.7628562493339],
[744.2422721523451, 133.86798112238196],
[1371.5069004252834, 442.66543942283977],
[282.4655113717247, -429.1814834062374],
[476.45173785923635, -720.5808704615044],
[1654.759088021057, 272.41423394307293],
[320.48581521135054, -612.6330186088622],
[891.7629576283271, -17.065209848663677]]
kmeans_labels=[v.prediction for v in local_kmeans_data]
kmeans_labels[:10]
[4, 5, 3, 8, 7, 9, 4, 6, 9, 1]
center_points = np.array([v.tolist() for v in centers])
center_points
array([[1786.05403822, -599.20460347],
[ 977.04457649, -215.71517215],
[ 812.80482579, -729.55919676],
[ 39.11788196, 516.22382925],
[ 390.19269174, -981.07219707],
[ 371.15570521, -12.39382026],
[1646.04041628, -29.58587012],
[1285.25749431, 501.79902635],
[ 752.46994207, 422.04360274],
[ 194.59848433, -495.85516158]])
kmeans_points= np.array(kmeans_points)
plt.figure(figsize=(7,7))
plt.scatter(kmeans_points[:, 0], kmeans_points[:, 1], s=20, c=kmeans_labels,cmap=’jet’)
plt.scatter(center_points[:, 0], center_points[:, 1], s=200,c=’black’, alpha=0.5);
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com