CS代考 MUT09)\n”

Before you start working on the exercise¶
Use Python version 3.7 up to 3.9. Make sure not to use Python 3.10
It is highly recommended to create a virtual environment for this course. You can find resources on how to create a virtual environment on the ISIS page of the course.

Copyright By PowCoder代写 加微信 powcoder

Make sure that no assertions fail or exceptions occur, otherwise points will be subtracted.
Use all the variables given to a function unless explicitly stated otherwise. If you are not using a variable you are doing something wrong.
Read the whole task description before starting with your solution.
After you submit the notebook more tests will be run on your code. The fact that no assertions fail on your computer locally does not guarantee that you completed the exercise correctly.
Please submit only the notebook file with its original name. If you do not submit an ipynb file you will fail the exercise.
Edit only between YOUR CODE HERE and END YOUR CODE.
Verify that no syntax errors are present in the file.
Before uploading your submission, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel\Restart) and then run all cells (in the menubar, select Cell\Run All).

import sys

if (3,7) <= sys.version_info[:2] <= (3, 9): print("Correct Python version") print(f"You are using a wrong version of Python: {'.'.join(map(str,sys.version_info[:3]))}") $$\Large\textbf{Python Programming for Machine Learning}$$$$\Large\textbf{Exam}$$ $$\text{Department of Intelligent Data Analysis and Machine Learning}$$ $${29\text{th of November}\> 2021}$$

Read before starting with the exam!¶
The exam has a similar format to the exercise sheets you completed throught the course.

Each exercise consists of:

Explanation
Implementation

The overwrite part means that after your function has been tested the expected value will be placed in the corresponding variables, so that if you get stuck you can continue with the next exercise. If you get stuck in a task, it highly recommended to continue to another task. Even if your solution is not correct and does not pass all the tests it will receive partial credit for the correct parts.

If a solution cell does not compile (results in a SystaxError) it will receive ZERO (0) credits, even if the implementation is principally correct.¶
For each exercise there will be a maximum number of loops allowed. If your function contains more loops than allowed, you will be notified during the function definition, and the function will automatically fail in the tests. Note that “unrolling a loop” (repeating a line many times) is also considered a loop.

For technical reasons the following functions are banned throughout the notebook.

sum (but np.sum is allowed)
np.vectorize
np.fromiter
np.fromfunction
np.apply_along_axis

If you use any of these functions in your solution will receive 0 points.

Important:

Execute every cell in the notebook. You may also try to restart your kernel and execute all cells, in case something went wrong.

If you were not able to implement one function you may proceed with a next exercise by using data generated from the expected output functions.

Personal student information¶
In the following cell fill in your real personal information. Make sure that the code compiles. This information may be used later for you class certificates.

NAME = “Max” # your first name
MID_NAME = “” # your middle or empty string ”
SURNAME = “Mustermann” # your last name

MATRICULATION_NUMBER = -1 # e.g. 412342 as integer

HOME_UNIVERSITY = “” # e.g. TU Berlin, HU Berlin, Uni Potsdam, etc…
MODULE_NAME = “” # e.g CA, ML-1, ML-2, Standalone
COURSE_OF_STUDY = “” # e.g. Mathematics, Computer Science, Physics, etc…
DEGREE = “” # e.g. Erasmus, Bachelor, Diplom, Master, PhD or Guest (all others)

from IPython.display import Markdown as md

f”## Hello {NAME} {MID_NAME} {SURNAME} \n”
f”### Your matriculation number is {MATRICULATION_NUMBER} \n”
f”### You study at {HOME_UNIVERSITY} {COURSE_OF_STUDY} {DEGREE} \n”
f”### Module name: {MODULE_NAME}\n”
“## [zoom exam room](https://tu-berlin.zoom.us/j/68316661651?pwd=Yng4TmJDcW1sU3dpMTZwWlAzQktMUT09)\n”
“## password: 997046”

print(“Checking if external packages are installed correctly.”)
import numpy
import scipy
import sklearn
import pandas
except ImportError:
print(“Please install the needed packages using \”pip install -U numpy scipy pandas scikit-learn\””)
numpy_version = tuple(map(int, numpy.__version__.split(“.”)))
scipy_version = tuple(map(int, scipy.__version__.split(“.”)))
sklearn_version = tuple(map(int, sklearn.__version__.split(“.”)))
pandas_version = tuple(map(int, pandas.__version__.split(“.”)))
if numpy_version >= (1, 18, 0):
print(“NumPy version ok!”)
print(“Your NumPy version is too old!!!”)

if scipy_version >= (1, 6, 0):
print(“SciPy version ok!”)
print(“Your SciPy version is too old!!!”)

if sklearn_version >= (1, 0):
print(“scikit-learn version ok!”)
print(“Your scikit-learn version is too old!!!”)

if pandas_version >= (1, 3, 0):
print(“pandas version ok!”)
print(“Your pandas version is too old!!!”)

from IPython.core.display import HTML as Center

In this notebook we will explore a semi supervised learning task. The task is to cluster a set of datapoints given the labels of a small subset of the dataset. The original dataset looks like this:

Most of the labels have however been removed from the data that we will use in this notebook. The goal is to cluster the data back into 3 clusters.

The original noisy data consists of more than 2 dimensions. We will use Principal Component Analysis (PCA) to remove the extra dimensions.

import numpy as np
from minified import max_allowed_loops, no_imports
from unittest import TestCase
from sklearn.utils.validation import check_is_fitted
from typing import Optional, Tuple

%matplotlib inline

t = TestCase()

Exercise 1: Data loading, initial data exploration and visualization¶
In this exercise we will load the data from the file. Then we will apply Principal Component Analysis (PCA) to determine how many components of the dataset are actually useful for our purposes. Then we will create two plots. The first one will regard the explained variance of each component after PCA has been applied. After that we will remove the components which we deem superfluous and will plot the transformed data as a scatter plot.

Exercise 1.1: Read from CSV using Pandas (5 points).¶
Implement a function that loads a CSV file as a Pandas DataFrame. You can use any Pandas
functions you want. However you cannot use any loops in the function.

import pandas as pd

@no_imports
@max_allowed_loops(0)
def read_data(filename: str) -> Optional[pd.DataFrame]:
Read data from a CSV file and return a pandas DataFrame. If the file does not
exists, the function returns `None`.

filename: The name of the CSV file to read

A Pandas DataFrame containing the data
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE

tiny_df = read_data(“tiny.csv”)
expected_tiny_df = pd.DataFrame(
“varA”: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
“varB”: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7],
“varC”: [0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
“label”: [1, 2, 0, -1, -1, -1],

pd.testing.assert_frame_equal(expected_tiny_df, tiny_df)

data_df = read_data(“data.csv”)

# check column names
np.testing.assert_array_equal(
data_df.columns, [“varA”, “varB”, “varC”, “varD”, “varE”, “varF”, “label”]

# check data types of columns
np.testing.assert_array_equal(data_df.dtypes, [np.float64] * 6 + [np.int64])

# check first row
np.testing.assert_array_almost_equal(
data_df.head(1).values[0], [-0.07, -0.547, -0.028, 0.791, 0.119, 0.004, -1.0],

print(data_df.head(5))

should_be_none = read_data(“not_a_file.csv”)
t.assertIsNone(should_be_none)

read_data.assert_no_imports()
read_data.assert_not_too_many_loops()

from expected import get_exercise_1_1

data_df = get_exercise_1_1()

Exercise 1.2: Perform PCA using scikit-learn (5 points).¶
The data that we just loaded consists of 6 dimensions and a label. However, only two of the dimensions contain data relevant for the task. The other dimensions contain gaussian noise. In this task we want to extract the useful information from the dataset using Principal Component Analysis (PCA).

from sklearn.decomposition import PCA

@no_imports
@max_allowed_loops(0)
def transform_data_pca(
data: np.ndarray, n_components: Optional[int] = None
) -> Tuple[np.ndarray, PCA]:
Perform PCA on the data and return the transformed data.

data: A numpy array containing the data to transform
n_components: The number of components (dimensions) to keep (relevant argument
for PCA). If it is set to None, all components are kept.

A tuple containing the transformed data and the PCA instance
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE

tiny_data = np.array(
[0.1, 0.0],
[0.2, 0.0],
[0.3, 0.1],

tiny_expected = np.array(
[-0.10389606, -0.01779664],
[-0.0157286, 0.02938915],
[0.11962465, -0.01159251],
# apply pca for example data
tiny_result, tiny_pca = transform_data_pca(tiny_data)
# check return types
t.assertIsInstance(tiny_pca, PCA)
t.assertIsInstance(tiny_result, np.ndarray)
# in the tiny example, the first component is responsible for most of the variance
# therefore, the first component’s explained variance is very close to 1
t.assertGreater(tiny_pca.explained_variance_ratio_[0], 0.95)
t.assertLess(tiny_pca.explained_variance_ratio_[1], 0.05)
np.testing.assert_array_almost_equal(tiny_expected, tiny_result)

# the labels should not be used for PCA
data_array = data_df.values[:, :-1]
transformed_data_array, pca = transform_data_pca(data_array)

# check that pca instance has been fitted
check_is_fitted(pca)
np.testing.assert_array_equal(data_array.shape, transformed_data_array.shape)

# check that n_components is respected
transformed_data_array_two_components, _ = transform_data_pca(
data_array, n_components=2
np.testing.assert_array_equal(transformed_data_array_two_components.shape, (998, 2))

transform_data_pca.assert_no_imports()
transform_data_pca.assert_not_too_many_loops()

from expected import get_exercise_1_2

transformed_data_array = get_exercise_1_2()

Exercise 1.3: Plotting the explained variance for each transformed dimension (10 points).¶
Plot the cumulative explained variance for each component.
Use the explained_variance_ratio_ member of the PCA instance.
Draw a line plot for the cumulative explained variance. The markers should be visible as circles.
Draw a red, dashed horizontal line at the threshold ratio.
The horizontal ticks should have a range from 1 up to the number of components.
The title of the x-axis should be Number of components.
The title of the y-axis should be Cumulative explained variance.
The title of the plot should be Cumulative expalined variance against number of components kept.

import matplotlib.pyplot as plt

@no_imports
@max_allowed_loops(0)
def plot_pca_variance(pca: PCA, threshold: float = 0.95) -> None:
Plot the explained variance of the PCA. A line plot is drawn for the cumulative
explained variance of the components. A dashed horizontal line is drawn for the
threshold.

pca: The PCA instance to use to plot the explained variance
threshold: The threshold for the explained variance to use for plotting
plt.figure(figsize=(8, 6))
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE

plot_pca_variance(pca)
plot_pca_variance.assert_no_imports()
plot_pca_variance.assert_not_too_many_loops()

This is a plot that you can use as a reference¶

From the above plot we see that the first two components explain most (> 95%) of the variance therefore we can keep only the first two components after transformation.

For the rest of the exercises we will use the transformed data and the labels. The following cell creates the two variables which contain the data and the labels.

data = transformed_data_array[:, :2]
labels = data_df[“label”].values

print(data.shape, labels.shape)

Exercise 1.4: Plotting the clusters of the transformed data (20 points).¶
Create a scatter plot of the data, which visualizes the data with the known labels
the means of each cluster and the unlabeled data. Below you can find a scatter plot that you can use as a reference. Use the plt.plot function to create the plots not plt.scatter.

Title should be set according to the provided argument
Font size: 25.

Font size: 20.

Elements for which the label is not known (label == -1):

Marker shape should be a circle.
Marker color should be black.
Marker alpha should be 0.1.
The label should be “Unlabeled”.

Elements for which the label is known:

Marker shape should be a square.
Marker color should be unique for each cluster.
Marker alpha should be 0.75.
Marker size should be 50 (use s argument).
Markers should be drawn above all unlabeled points (use zorder).
The label of each cluster should be “Cluster {label}: {number_of_elements_in_cluster}”. For example: “Cluster 0: 10”.

The means of each cluster:

Marker shape should be a cross.
Marker color should be red.
Marker size should be 100 (use s argument).
Markes should be drawn above all other markers (use zorder).

@no_imports
@max_allowed_loops(1)
def plot_clusters(X: np.ndarray, y: np.ndarray, title: str = “”) -> None:
Plot the data. Datapoints for which the cluster label is known are plotted
differently compared to datapoints for which the label is unknown.

The empirical mean of each cluster is also plotted.

X: The data to plot
y: The cluster labels (if the cluster is not known the label is `-1`)
title: The title of the plot

plt.figure(figsize=(12, 10))
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE

plot_clusters(data, labels, “Data distribution”)
plot_clusters.assert_no_imports()
plot_clusters.assert_not_too_many_loops()

This is a plot that you can use as a reference¶

Exercise 2: Initialization steps¶
Our goal is to assign a cluster for each data-point. We will model the three clusters as three multivariate normal distributions. Therefore we need to estimate the parameters of each distribution. The algorithm that we use is iterative and requires an initialization for the parameters. Since we model the distributions using multivariate normal distributions we need to specify three parameters, the mean of each cluster ($\Large{\mu}$), the covariance matrix of each gaussian ($\Large{\Sigma}$) and the proportion of elements in each cluster ($\Large{\pi}$). In Exercise 3 we will implement the algorithm that estimates the correct parameters.

Exercise 2.1: Initial estimation of cluster means (7 points).¶
Calculate the empirical cluster mean for each cluster based on labeled data-points.

@no_imports
@max_allowed_loops(1)
def initialize_mus(X: np.ndarray, y: np.ndarray) -> np.ndarray:
Initialize the means of the clusters. The empirical mean of each cluster
is calculated and used as the initial mean. If a datapoint does not have a label
assigned to it (`label == -1`) then it is not considered for the mean calculation.

X: The datapoints
y: The cluster labels (if the cluster is not known the label is `-1`)

The initial, empirical means
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE

return mus

test_data = np.random.randn(100, 2)

# test case with the same label everywhere
test_labels = np.zeros(len(test_data), dtype=np.int64)
test_result = initialize_mus(test_data, test_labels)
np.testing.assert_array_equal(test_result.shape, (1, 2))
expected_result = np.mean(test_data, axis=0, keepdims=True)
np.testing.assert_array_equal(test_result, expected_result)

# test that values with no label are not used for the mean calculation
# create random data with non-label
random_data_to_add = np.random.rand(100, 2)
non_labels = np.full(len(random_data_to_add), -1)
test_data = np.concatenate((test_data, random_data_to_add))
test_labels = np.concatenate((test_labels, non_labels))

# calculate again (result should be the same)
new_result = initialize_mus(test_data, test_labels)
np.testing.assert_array_equal(new_result.shape, (1, 2))
np.testing.assert_array_almost_equal(new_result, test_result)

# add second cluster
new_cluster_data = np.random.rand(20, 2) + [2, 2]
new_cluster_labels = np.full(len(new_cluster_data), 1)
test_data = np.concatenate((test_data, new_cluster_data))
test_labels = np.concatenate((test_labels, new_cluster_labels))

two_clusters_result = initialize_mus(test_data, test_labels)
np.testing.assert_array_equal(two_clusters_result.shape, (2, 2))
# first cluster mean should stay the same
np.testing.assert_array_almost_equal(two_clusters_result[0], expected_result[0])
# second cluster mean should be the mean of the new cluster
np.testing.assert_array_almost_equal(
two_clusters_result[1], np.mean(new_cluster_data, axis=0)

# test with shuffled data (results should stay the same)
shuffle_idx = np.random.rand(len(test_data)).argsort()
test_data = test_data[shuffle_idx]
test_labels = test_labels[shuffle_idx]
shuffle_result = initialize_mus(test_data, test_labels)
np.testing.assert_array_almost_equal(shuffle_result, two_clusters_result)

mus = initialize_mus(data, labels)
print(mus)
np.testing.assert_array_equal(mus.shape, (3, 2))

initialize_mus.assert_no_imports()
initialize_mus.assert_not_too_many_loops()

from expected import get_exercise_2_1

mus = get_exercise_2_1()
number_of_clusters = len(mus)
print(f”Number of clusters: {number_of_clusters}”)

Exercise 2.2: Initialization of covariance matrices (5 points).¶
Implement a function that initializes the covariance matrix for each cluster. All clusters are initialized with the same covariance matrix which contains the

Initialize diagonal covariance matrices for each cluster.
The shape of the output should be (K, d, d) where $K$ is the number of clusters and $d$ is the dimensionality of the data.
Broadcasting hint: $(K, d, 1) \times (1, d, d) = (K, d, d)$.

@max_allowed_loops(0)
@no_imports
def initialize_sigmas(K: int, d: int, initial_value: float) -> np.ndarray:
Initialize the covariance matrix for each cluster. The initial covariance matrix for
each cluster is a diagonal matrix with the diagonal elements equal to
`initial_value`.

K: The number of clusters
d: The dimension of the data
initial_value: The initial value for the diagonal elements of the covariance

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com