程序代写代做代考 python data science Microsoft Learning Experiences

Microsoft Learning Experiences

CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 06 Workshop

In this workshop, you will use Azure Machine Learning Studio to build k-means clustering models.

Task 1: K-Means for Adult Census Income

Prepare the Data

1. Open a browser and browse to https://studio.azureml.net. Then sign in using the Microsoft account
associated with your Azure ML account.

2. Create a new blank experiment and name it Adult Income Clustering.

3. In the Adult Income Clustering experiment, drag the Adult Census Income Binary Classification
sample dataset to the canvas. Note that we use this dataset doesn’t mean you are going to do a
classification task!

4. Visualize the output of the dataset, and review the data it contains. The dataset contains the following
variables:

 age: A numeric feature representing the age of the census respondent.
 workclass: A string feature representing the type of employment of the census respondent.
 fnlwgt: A numeric feature representing the weighting of this record from the census sample

when applied to the total population.
 education: A string feature representing the highest level of education attained by the census

respondent.
 education-num: A numeric feature representing the highest level of education attained by the

census respondent.
 marital-status: A string feature indicating the marital status of the census respondent.
 occupation: A string feature representing the occupation of the census respondent.
 relationship: A categorical feature indicating the family relationship role of the census

respondent.
 race: A string feature indicating the ethnicity of the census respondent.
 sex: A categorical feature indicating the gender of the census respondent.
 capital-gain: A numeric feature indicating the capital gains realized by the census respondent.
 capital-loss: A numeric feature indicating the capital losses incurred by the census respondent.
 hours-per-week: A numeric feature indicating the number of hours worked per week by the

census respondent.
 native-country: A string feature indicating the nationality of the census respondent.
 income: A label indicating whether the census respondent earns $50,000 or less, or more than

$50,000.

https://studio.azureml.net

2

5. Add a Select Columns in Dataset module to the experiment, and connect the output of the dataset to
its input.

6. Select the Select Columns in Dataset module, and in the Properties pane launch the column selector.
Then use the column selector to exclude the following columns:

 education
You can use theWith Rules page of the column selector to accomplish this as shown here:

7. Add a Normalize Data module to the experiment and connect the output of the Select Columns
in Dataset module to its input.

3

8. Set the properties of the Normalize Data module as follows:
 Transformation method: MinMax
 Use 0 for constant columns: Unselected
 Columns to transform: All numeric columns

9. Add an Edit Metadata module to the experiment, and connect the Transformed dataset (left) output of
the Normalize Data module to its input.

10. Set the properties of the Edit Metadata module as follows:
 Column: All string columns
 Data type: Unchanged
 Categorical: Make categorical
 Fields: Unchanged
 New column names: Leave blank

11. Verify that your experiment looks like the following, and then save and run the experiment:

12. When the experiment has finished running, visualize the output of Edit Metadata and verify that:
 The columns you specified have been removed.
 All numeric columns now contain a scaled value between 0 and 1.
 All string columns now have a Feature Type of Categorical Feature.

Determine the Number of Clusters

Now that the data is prepared, you are ready to use K-Means clustering to separate the data out into clusters.

1. Add a K-Means Clustering module to the Adult Income Clustering experiment, and set its properties
as follows:

 Create trainer mode: Single Parameter

4

 Range for number of Centroids: 3
 Initialization for sweep: K-Means ++
 Random number seed: 100
 Metric: Euclidian
 Iterations: 100
 Assign Label Model: Filling missing values

2. Add a Train Clustering Model module to the experiment, and connect the output of the K-Means
Clustering module.

3. Verify that the experiment resembles this, and then save and run the experiment.

Visualize the Data

1. Add a Convert to CSV module to the experiment and connect the output of the Train Clustering
Model module to its input. Then run the experiment.

2. Right-click the output of the Convert to CSV module and in the Open in a new workbook submenu,
click Python 2.

3. In the new browser tab that opens, at the top of the page, rename the new workbook Adult Income
Clustering Notebook.

4. Review the code that has been generated automatically. The code in the first cell loads a data frame
from your experiment. The code in the second cell uses the frame.head(5) command to display the first
5 rows of data.

5. On the Insert menu, click Insert Cell Below to add a new cell to the notebook.

5

6. Assign frame to frame1

7. Look at the data types in frame1

8. Generate the following plots (see example codes below).

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn.decomposition as de
import pandas as pd

## Create data frames for each cluster
temp0 = frame1.ix[frame1[‘Assignments’] == 0, :]
temp1 = frame1.ix[frame1[‘Assignments’] == 1, :]
temp2 = frame1.ix[frame1[‘Assignments’] == 2, :]

## Scatter plots of area vs other numeric variables
num_cols = [‘age’, ‘fnlwgt’, ‘education-num’, ‘capital-gain’, ‘capital-loss’]
fig = plt.figure(figsize = (12, 24))
fig.clf()
i = 1
for col in num_cols:
ax = fig.add_subplot(5, 1, i)
title = ‘Scatter plot of ‘ + col + ‘ vs. age’
temp0.plot(kind = ‘scatter’, x = col, y = ‘age’, color=’DarkBlue’, label=’Group 0′, alpha = 0.3, ax = ax)
temp1.plot(kind = ‘scatter’, x = col, y = ‘age’, color=’Red’, label=’Group 1′, alpha = 0.3, ax = ax)
temp2.plot(kind = ‘scatter’, x = col, y = ‘age’, color=’Yellow’, label=’Group 2′, alpha = 0.3, ax = ax)
ax.set_title(title)
ax.set_xlabel(”)
i += 1

if(Azure == True): fig.show()

6

7

Task 2 Forest Fire Clustering

1. Download the dataset ForestFires.ARFF from Blackboard

2. Upload the dataset onto Azure ML Studio

3. Repeat the data analysis, clustering and plots as Task 1:

a. Remove the lowest 1% observations (i.e., rows) of variable FFMC

b. Remove the highest 1% observations (i.e., rows) of variables ISI and Rain

c. Use z-score to normalize data

d. Set cluster number to be in the range from 2 to 5

e. In Jupyter notebook, generate the following plots

8

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn.decomposition as de
import pandas as pd

Azure = True
## Compute and plot the clusters by first two principal components
num_cols = [‘FFMC’, ‘DMC’, ‘DC’, ‘ISI’, ‘temp’, ‘RH’, ‘area’]
pca = de.PCA(n_components = 2)
pca.fit(frame1[num_cols].as_matrix())
pca_frame = pd.DataFrame(pca.transform(frame1[num_cols].as_matrix()))
pca_frame[‘Assignments’] = frame1.Assignments
temp0 = pca_frame.ix[pca_frame[‘Assignments’] == 0, :]
temp1 = pca_frame.ix[pca_frame[‘Assignments’] == 1, :]
temp2 = pca_frame.ix[pca_frame[‘Assignments’] == 2, :]
temp0.columns = [‘PC1’, ‘PC2’, ‘Assignments’]
temp1.columns = [‘PC1’, ‘PC2’, ‘Assignments’]
temp2.columns = [‘PC1’, ‘PC2’, ‘Assignments’]
fig = plt.figure(figsize = (12,6))
fig.clf()
ax = fig.gca()
temp0.plot(kind = ‘scatter’, x = ‘PC1’, y = ‘PC2′, color=’DarkBlue’, label=’Group 0′, alpha = 0.3, ax = ax)
temp1.plot(kind = ‘scatter’, x = ‘PC1’, y = ‘PC2′, color=’Green’, label=’Group 1′, alpha = 0.3, ax = ax)
temp2.plot(kind = ‘scatter’, x = ‘PC1’, y = ‘PC2′, color=’Red’, label=’Group 1′, alpha = 0.3, ax = ax)
ax.set_title(‘Clusters by principal component projections’)
ax.set_xlabel(‘First principal component’)
ax.set_ylabel(‘Second principal component’)
if(Azure == True): fig.show()

9

10

Task 3 Implementation of Simple K-Means Using Python

1. Use the dataset ForestFires.ARFF from Task 2

2. Replace the clustering modules in Task 2 by your own Python script:

Please check the following references:

 Peter Harrington. Machine Learning in Action. Manning Publishing, 2012, Chapter 10

 Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006, Chapter 9.1