Before you start working on the exercise¶
Use Python version 3.7 up to 3.9. Make sure not to use Python 3.10
It is highly recommended to create a virtual environment for this course. You can find resources on how to create a virtual environment on the ISIS page of the course.
Copyright By PowCoder代写 加微信 powcoder
Make sure that no assertions fail or exceptions occur, otherwise points will be subtracted.
Use all the variables given to a function unless explicitly stated otherwise. If you are not using a variable you are doing something wrong.
Read the whole task description before starting with your solution.
After you submit the notebook more tests will be run on your code. The fact that no assertions fail on your computer locally does not guarantee that you completed the exercise correctly.
Please submit only the notebook file with its original name. If you do not submit an ipynb file you will fail the exercise.
Edit only between YOUR CODE HERE and END YOUR CODE.
Verify that no syntax errors are present in the file.
Before uploading your submission, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel\Restart) and then run all cells (in the menubar, select Cell\Run All).
import sys
if (3,7) <= sys.version_info[:2] <= (3, 9):
print("Correct Python version")
print(f"You are using a wrong version of Python: {'.'.join(map(str,sys.version_info[:3]))}")
Exercise Sheet 3: Advanced Numpy¶
In the third exercise sheet we will work on advanced numpy topics and application on machine learning tasks. You will implement the complete datascience pipline, starting with data loading, plotting and data exploration, and finally implementing a machine learning model and applying it on the data.
For each exercise there will be a maximum number of loops allowed. If your function contains more loops than allowed, you will be notified during the function definition, and the function will automatically fail in the hidden tests.
For technical reasons the following functions are banned throughout the notebook.
np.vectorize
np.fromiter
np.fromfunction
np.apply_along_axis
If you use one of these functions in your submissions it will automatically fail. The use of np.sum is allowed.
# EXECUTE the setup cell !
from typing import Dict, List, Tuple, Optional
from unittest import TestCase
t = TestCase()
from minified import max_allowed_loops, no_imports
from IPython.display import Markdown as md
Exercise 1.1: ( 8 points )¶
Read the data from the file data.csv and save it in a dictionary. The letters in data.csv are the assigned labels and their corresponding datapoints. Each datapoint is two-dimensional and consists of the given x- and y-values. Return a dictionary with the letters/labels as keys. The value assigned to each key should be a list of x- and y-values.
Do not forget to cast the vaules to float.
Number of loops allowed in this exercise: 1
@no_imports
@max_allowed_loops(1)
def read_from_file(file: str = "data.csv") -> Dict[str, List[Tuple[float, float]]]:
Opens a csv file and parses it line by line. Each line consists of a label and two
data dimensions. The function returns a dictionary where each key is a label and
the value is a list of all the datapoints that have that label. Each datapoint
is represented by a pair (2-element tuple) of floats.
file (str, optional): The path to the file to open and parse. Defaults to
“data.csv”.
Dict[str, List[Tuple[float, float]]]: The parsed contents of the csv file
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE
tiny_result = read_from_file(file=”tiny.csv”)
print(“tiny_result”, tiny_result)
tiny_expected = {“A”: [(0.8, 0.9), (0.2, 0.3)], “B”: [(0.9, 0.1)], “C”: [(2.0, 4.0)]}
t.assertEqual(tiny_result, tiny_expected)
D = read_from_file(file=”data.csv”)
print(f”Keys of D: {D.keys()}”, end=”\n\n”)
for k, v in D.items():
print(f”{len(v)} datapoints were assigned the label {k}”)
# Test All types
t.assertIsInstance(D, dict)
for d in D:
t.assertIsInstance(d, str)
t.assertIsInstance(D[d], list)
for el in D[d]:
t.assertIsInstance(el, tuple)
t.assertIsInstance(el[0], float)
t.assertIsInstance(el[1], float)
letters = “MNU”
t.assertEqual(set(D.keys()), set(letters))
t.assertTrue(all(len(v) > 99 for v in D.values()))
read_from_file.assert_not_too_many_loops()
read_from_file.assert_no_imports()
Exercise 1.2: ( 5 Pts )¶
Use numpy to stack all of the $N$ datapoints from the dictionary into one matrix $X$, containing the data.
Additionally, create one array $y$ with the corresponding integer labels.
Each datapoint $x_i \in X, \> i = \overline{1..N}$ is of dimension $D=2$. The label assigned to a datapoint has to be a positive integer. Every letter-label should map to one integer-label in $y$ accordingly.
Maping example: $A \rightarrow 0,\> C \rightarrow 1,\> K \rightarrow 2, …$ (The order of the keys/labels defines the numeric label. The first key is mapped to 0 and so on.)
Dataset $X$: $$\Large X \in \mathbb{R}^{(N, D)}$$
Labels $y$: $$\Large y \in \mathbb{N}^{(N,)} $$
Number of loops allowed in this exercise: 1 (for iterating over the keys of the dictionary)
import numpy as np
@no_imports
@max_allowed_loops(1)
def stack_data(
D: Dict[str, List[Tuple[float, float]]]
) -> Tuple[np.ndarray, np.ndarray]:
Convert a dictionary dataset into a two arrays of data and labels. The dictionary
keys represent the labels and the value mapped to each key is a list that
contains all the datapoints belonging to that label. The output are two arrays
the first is the datapoints in a single 2d array and a vector of intergers
with the coresponding label for each datapoint. The order of the datapoints is
preserved according to the order in the dictionary and the lists.
The labels are converted from a string to a unique int.
The datapoints are entered in the same order as the keys in the `D`. First
all the datapoints of the first key are entered then the second and so on.
Within one label order also remains.
D (Dict[str, List[Tuple[float, float]]]): The dictionary that should be stacked.
Tuple[np.ndarray, np.ndarray]: The two output arrays. The first is a
float-matrix containing all the datapoints. The second is an int-vector
containing the labels for each datapoint.
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE
return X, y
tiny_expected_X, tiny_expected_y = (
[0.0, 0.1],
[0.9, 0.7],
[0.8, 0.3],
np.array([0, 1, 1]),
tiny_result_X, tiny_result_y = stack_data(
{“B”: [(0.0, 0.1)], “A”: [(0.9, 0.7), (0.8, 0.3)]}
print(tiny_result_X, tiny_result_y)
np.testing.assert_allclose(tiny_expected_X, tiny_result_X)
np.testing.assert_allclose(tiny_expected_y, tiny_result_y)
X, y = stack_data(D)
print(X.shape, y.shape)
print(X.dtype, y.dtype)
expected_len = sum(len(x) for x in D.values())
print(f”Expected length for X, y: {expected_len}”)
t.assertEqual(X.shape, (expected_len, 2))
t.assertEqual(y.shape, (expected_len,))
t.assertEqual(X.dtype, np.float64)
t.assertEqual(y.dtype, np.int64)
t.assertEqual(set(y), set(range(len(D))))
Exercise 1.3: ( 4 Pts )¶
Write a function that returns a list of all $k$ clusters $C$. A cluster $C_k$ is composed of every datapoint $X_i$ assigned with the label $k$. There are as many clusters $C_k$ as there are unique labels in $y$.
$$\Large{\mathcal{C} = \{ C_1, C_2, \cdots, C_k \},\quad k = \overline{1..K}}$$
$$\Large C_k \in \mathbb{R}^{(N_k, D)}$$ Number of loops allowed in this exercise: 1
@no_imports
@max_allowed_loops(1)
def get_clusters(X: np.ndarray, y: np.ndarray) -> List[np.ndarray]:
Receives a labeled dataset and splits the datapoints according to label
X (np.ndarray): The dataset
y (np.ndarray): The label for each point in the dataset
List[np.ndarray]: A list of arrays where the elements of each array
are datapoints belonging to the label at that index.
>>> get_clusters(
np.array([[0.8, 0.7], [0, 0.4], [0.3, 0.1]]),
np.array([0,1,0])
>>> [array([[0.8, 0.7],[0.3, 0.1]]),
array([[0. , 0.4]])]
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE
tiny_result = get_clusters(
[0.8, 0.7],
[0.3, 0.1],
np.array([0, 1, 0]),
print(tiny_result)
tiny_expected = [
[0.8, 0.7],
[0.3, 0.1],
[0.0, 0.4],
for r, e in zip(tiny_result, tiny_expected):
np.testing.assert_allclose(r, e)
clusters = get_clusters(X, y)
# output is list
t.assertIsInstance(clusters, List)
t.assertEqual(len(letters), len(clusters))
# all elements are arrays
for el in clusters:
t.assertIsInstance(el, np.ndarray)
t.assertEqual(sum(map(len, clusters)), len(X))
Exercise 1.4: ( 8 Pts )¶
Split the data $X$ into training and testing data.
Return a list of clusters for training and a list of cluster for testing.
Utilize the function train_test_idxs from utils to split the data.
The train-test ratio should be 80-20
Use the function implemented in Exercise 1.3 get_clusters(X,y) to get the clusters.
Remember that when you split the dataset you need to keep the relationship between the data and the labels. Do not split the data and labels independently
Number of loops allowed in this exercise: 0
from utils import train_test_idxs
print(“train_test_idxs specification:\n”, train_test_idxs.__doc__)
train_indices, test_indices = train_test_idxs(L=20, test_ratio=0.3)
print(f”train_indices = {train_indices}”)
print(f”test_indices = {test_indices}”)
@no_imports
@max_allowed_loops(0)
def split(X: np.ndarray, y: np.ndarray) -> Tuple[List[np.ndarray], List[np.ndarray]]:
Split the data into train and test sets. The training and test set are
clustered by label using `get_clusters`. The size of the training set
is 80% of the whole dataset
X (np.ndarray): The dataset (2d)
y (np.ndarray): The label of each datapoint in the dataset `X` (1d)
Tuple[List[np.ndarray], List[np.ndarray]]: The clustered training and
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE
return tr_clusters, te_clusters
output = split(X, y)
tr_clusters, te_clusters = output
t.assertIsInstance(output, Tuple)
t.assertIsInstance(tr_clusters, List)
t.assertIsInstance(te_clusters, List)
t.assertEqual(len(tr_clusters), len(te_clusters))
t.assertEqual(len(tr_clusters), len(letters))
t.assertEqual(len(te_clusters), len(letters))
for el in tr_clusters + te_clusters:
t.assertIsInstance(el, np.ndarray)
n_in_train = sum(map(len, tr_clusters))
n_in_test = sum(map(len, te_clusters))
t.assertEqual(n_in_train + n_in_test, len(X))
percent_train = n_in_train / len(X)
print(f”percent_train = {percent_train}”)
t.assertGreaterEqual(percent_train, 0.79)
t.assertLessEqual(percent_train, 0.81)
Exercise 1.5: (5 Pts )¶
Compute the mean $\mu_k$ of each cluster $C_k$. Return a list of all cluster means $\mu$.
$$\Large{\mu = \{ \mu_1, \mu_2, \cdots, \mu_k \},\quad k = \overline{1..K}}$$ Number of elements in a cluster $k$:
$$\Large{N_k = | C_k |, \quad C_k \in \mathbb{R}^{(N_k, D)}}$$
The $k$-th cluster mean $\mu_k$:
$$\Large{ \mu_k = \frac{1}{N_k}\sum_{x_i \in C_k} x_i }$$
Number of loops allowed in this exercise: 1 (to iterate over the clusters)
@no_imports
@max_allowed_loops(1)
def calc_means(clusters: List[np.ndarray]) -> np.ndarray:
For a collections of clusters calculate the mean for each cluster
clusters (List[np.ndarray]): A list of 2d arrays
np.ndarray: A matrix where each row represents a mean of a cluster
>>> tiny_clusters = [
np.array([[0.2, 0.3], [0.1, 0.2]]),
np.array([[0.8, 0.9], [0.7, 0.5], [0.6, 0.7]]),
>>> calc_means(tiny_clusters)
array([[0.15, 0.25]), [0.7,0.7]])
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE
tiny_clusters = [
np.array([[0.2, 0.3], [0.1, 0.2]]),
np.array([[0.8, 0.9], [0.7, 0.5], [0.6, 0.7]]),
tiny_result = calc_means(tiny_clusters)
print(tiny_result, end=”\n\n”)
tiny_expected = np.array([[0.15, 0.25], [0.7, 0.7]])
np.testing.assert_allclose(tiny_result, tiny_expected)
means = calc_means(tr_clusters)
print(means)
t.assertIsInstance(means, np.ndarray)
t.assertEqual(means.shape, (len(letters), 2))
Exercise 2.1: Scatter plot of clusters ( 15 points )¶
Create a scatter plot of size 8×8.
Plot each datapoint of a cluster $x_{ik} \in C_k$ as dots with an alpha value of 0.6 and a label.
The plot-label should contain both the cluster’s letter-label as well as its integer-label.
Further, plot the cluster’s mean $\mu_k$ as a red cross of size 7. The plot should also have a label for each cluster’s mean, giving information on its exact coordinates.
The title of the plot should be ‘Scatter plot of the clusters’ in fontsize 20.
Label for the scatter plots example: A = 0
Label for the cluster means example (use LaTeX): _$\mu_A:$[1.23 0.56]_
If the mean of each cluster is not provided, use calc_means(clusters) to calculate the means.
Number of loops allowed in this exercise: 1 (for iteration over the clusters)
import matplotlib.pyplot as plt
%matplotlib inline
@no_imports
def plot_scatter_and_mean(
clusters: List[np.ndarray],
letters: List[str],
means: Optional[List[np.ndarray]] = None,
) -> None:
Create a scatter plot visulizing each cluster and its mean
clusters (List[np.ndarray]): A list containing arrrays representing
each cluster
letters (List[str]): The “name” of each cluster
means (Optional[List[np.ndarray]]): The mean of each cluster. If not
provided the mean of each cluster in `clusters` should be calculated and
assert len(letters) == len(clusters)
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE
plot_scatter_and_mean(tr_clusters, letters, means=None)
Exercise 2.2: (15 points)¶
To make it easier to visually analyse the the differences between clusters, the data can be projected onto an axis. Plot a histrogram for the projection onto the given axis.
The histogram should have 30 bins, be 50% transparent and labeled. The area under the histogram should be normalized and sum to 1 to represent a proper distribution. It can be done by setting the corresponding parameter. – The bars width should have 4/5 of the bins width.
Create a scatter plot of size 14×5.
Plot the mean of each cluster as a vertical, dashed, red line.
Label for the histograms example: A
The title of the plot should be dynamic, have a font size of 20 and explain the axis of the projection, e.g. “Projection to axis 0 histogramm plot” or “Projection to axis 1 histogramm plot”, depending on the axis.
Number of loops allowed in this exercise: 1 (to iterate over the clusters)
@no_imports
def plot_projection(
clusters: List[np.ndarray], letters: List[str], means: np.ndarray, axis: int = 0
Plot a histogram of the dimension provided in `axis`
clusters (List[np.ndarray]): The clusters from which to create the historgram
letters (List[str]): The string representation of each class
means (np.ndarray): The mean of each class
axis (int): The axis from which to create the historgram. Defaults to 0.
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE
plot_projection(tr_clusters, letters, means, axis=0)
Exercise 3.1: (8 points)¶
Compute the within cluster covariance $S_w$ to further analyse the distribution of the data in the clusters. Sum up the covariance matrices of each cluster to get the one average within cluster corvariance matrix. This is shown in the formula below. Covariance matrices describe the relationship between the x and y dimensions of the data.
$$\boxed{\Large{S_w = \sum_{k=1}^K \sum_{x_i \in C_k} (x_i – \mu_k)^{\top}} (x_i – \mu_k), \quad S_w \in \mathbb{R}^{(D, D)}}$$ Reminder: Data $C$ is a set of clusters $C_k$, where $K$ is the total number of clusters. $${\mathcal{C} = \{ C_1, C_2, \cdots, C_k \},\quad k = \overline{1..K}}$$
Number of elements in a cluster $k$: $${N_k = | C_k |, \quad C_k \in \mathbb{R}^{(N_k, D)}}$$
$k$-th cluster mean $\mu_k$: $${ \mu_k = \frac{1}{N_k}\sum_{x_i \in C_k} x_i }$$
Number of loops allowed in this exercise: 1 (to iterate over the clusters)
@no_imports
@max_allowed_loops(1)
def within_cluster_cov(clusters: List[np.ndarray]) -> np.ndarray:
Calculate the within class covariance for a collection of clusters
clusters (List[np.ndarray]): A list of clusters each consisting of
an array of datapoints
np.ndarray: The within cluster covariance
>>> within_cluster_cov(
[array([[0.2, 0.3], [0.1, 0.2]]), array([[0.8, 0.9], [0.7, 0.5], [0.6, 0.7]])]
>>> array([[0.025, 0.025],
[0.025, 0.085]])
d = clusters[0].shape[1]
S_w = np.zeros((d, d))
# YOUR CODE HERE
raise NotImplementedError(“Relplace this line with your code”)
# YOUR CODE HERE
tiny_clusters = [
np.array([[0.2, 0.3], [0.1, 0.2]]),
np.array([[0.8, 0.9], [0.7, 0.5], [0.6, 0.7]]),
tiny_expected = np.array([[0.025, 0.025], [0.025, 0.085]])
tiny_result = within_cluster_cov(tiny_clusters)
print(tiny_result)
np.testing.assert_allclose(tiny_expected, tiny_result)
S_w = within_cluster_cov(tr_clusters)
print(S_w)
t.assertIsInstance(S_w, np.ndarray)
t.assertEqual(S_w.shape, (2, 2))
# check if symmetric
np.testing.assert_allclose(S_w, S_w.T)
Exercise 3.2: ( 3 + 9 points )¶
To compute the between cluster covariance, the calculation of the mean of means is necessary. In the function calc_mean_of_means(clusters) you must reuse your function calc_means(clusters).
Mean of means: $$\Large{ \mu = \frac{1}{N}\sum_{C_i \in \mathcal{C}}{C_i}},\quad \text{where}\quad N = |\mathcal{C}|$$
The between cluster covariance describes the relation of the datapoints from one cluster to the other. It focuses on the differences rather then the similarities. Use the function calc_mean_of_means(clusters) in the function between_cluster_cov(clusters) to access the mean of means. You only have to implement the given formulas, and do not need to fully understand the underlying concept.
Between cluster covariance: $$\boxed{\Large{S_b = \sum_{k=1}^K N_k (\mu_k – \mu) (\mu_k – \mu)^{\top}}}$$
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com