F21_APS1070_Project_2
Project 2, APS1070 Fall 2021¶
Anomaly Detection Algorithm using Gaussian Mixture Model [13 Marks]
Deadline: OCT 22, 9 PM
Academic Integrity
This project is individual – it is to be completed on your own. If you have questions, please post your query in the APS1070 Piazza Q&A forums (the answer might be useful to others!).
Do not share your code with others, or post your work online. Do not submit code that you have not written yourself. Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct.
Please fill out the following:
Name:
Student Number:
Part 1: Getting started [1.5 Marks]¶
We are going to work with a credit card fraud dataset. This dataset contains 28 key features, which are not
directly interpretable but contain meaningful information about the dataset.
Load the dataset in CSV file using Pandas. The dataset is called creditcard.csv. Print out the first few columns of the dataset.
How many rows are there? _ [0.1]
What features in the dataset are present aside from the 28 main features? _ [0.1]
Which column contains the targets? [0.1]
To what do the target values correspond?_ [0.1]
In [ ]:
pip install wget
In [ ]:
import wget
wget.download(‘https://github.com/aps1070-2019/datasets/raw/master/creditcard.tar.gz’,’creditcard.tar.gz’)
In [ ]:
!tar -zxvf creditcard.tar.gz
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
df = pd.read_csv(‘creditcard.csv’)
In [ ]:
### YOUR CODE HERE ###
What is the percentage of entries in the dataset for each class? _ [0.1]
Is this data considered balanced or unbalanced? Why is this the case?_ [0.1]
Why is balance/imbalance important? How might this class ditribution affect a KNN classifier for example, which we explored in Project 1? _ [0.2]
In [ ]:
### YOUR CODE HERE ###
Next, split the dataset into a training (70%), validation (15%) and testing set (15%). Set the random state to 0. [0.2]
Make sure to separate out the column corresponding to the targets.
In [ ]:
### Split the data ###
X_train, X_val, X_test, y_train, y_val, y_test = 0, 0, 0, 0, 0, 0
Now, let’s take a look at the difference in distribution for some variables between fraudulent and non-fraudulent transactions:
In [ ]:
import matplotlib.gridspec as gridspec
features=[f for f in df.columns if ‘V’ in f]
nplots=np.size(features)
plt.figure(figsize=(15,4*nplots))
gs = gridspec.GridSpec(nplots,1)
for i, feat in enumerate(features):
ax = plt.subplot(gs[i])
sns.histplot(X_train[feat][y_train==1], stat=”density”, kde=True, color=”blue”, bins=50)
sns.histplot(X_train[feat][y_train==0], stat=”density”, kde=True, color=”red”, bins=50)
ax.legend([‘fraudulent’, ‘non-fraudulent’],loc=’best’)
ax.set_xlabel(”)
ax.set_title(‘Distribution of feature: ‘ + feat)
Explain how these graphs could provide meaningful information about anomaly detection using a gaussian model. [0.5]
Part 2: Single feature model with one Gaussian distribution: [2.5 Marks]¶
We’ll start by making a prediction using a single feature of our dataset at a time.
Please note that we only use V features in our model.
a. Fitting regardless of class:
Fit a single Gaussian distribution on a single feature of the full training dataset (both classes) using sklearn.mixture.GaussianMixture when n_components=1.
Compute AUC (Area under the ROC Curve) based on sklearn.mixture.GaussianMixture.score_samples on both the full training set and validation set (including both classes).
Repeat the above steps for each of the features and present your findings in a table.
Find the best 3 features to distinguish fraudulent transactions from non-fraudulent transactions based on the AUC of the validation set. [0.2]
Make a prediction based on a model’s scores: If the score_samples is lower than a threshold, we consider that transaction as a fraud. Find an optimal threshold that maximizes the F1 Score of the validation set for each of those 3 features separately. (Do not check every possible value for threshold, come up with a faster way!) Compute F1 score using sklearn.metrics.f1_score. [0.5]
Report the complexity of your method (Big O notation) for determining the optimal threshold.[0.5]
b. Fitting based on class:
Pick 3 features that had the best AUC in Part 2a.
Compute AUC and F1 score when you fit a Gaussian only on non-fraudulent transactions (instead of all the transactions).
Compare your results from parts 2a and 2b (AUC and F1 score) in a table. [0.8]
Are these results different or similar? Why?[0.5]
In [ ]:
### YOUR CODE HERE ###
Part 3: Multiple feature model with one Gaussian distribution: [1 Marks]¶
This part is similar to Part 2, but we will pick multiple features and visually set the number of components.
a. 2D plot:
Pick two features (say, f1 and f2).
Scatter plot (plt.scatter) those features on a figure (f1 on the x-axis and f2 on the y-axis).
Color the data points based on their class (non-fraudulent blue and fraudulent red).
Based on your plots, decide how many Gaussian components (n_components) you need to fit the data (focus on valid transactions). Explain. [0.25]
Fit your Gaussian model on all the data points.
Compute AUC on both training and validation sets
Pick 3 new pairs of features and repeat steps 2 to 6. [0.25]
For each pair, find a threshold to maximize your validation set F1 Score.[0.25]
For each pair, plot a figure similar to step 3 and put a circle around outliers based on your threshold (use the code of the similar figure in the tutorial) [0.25]
In [ ]:
### YOUR CODE HERE ###
Part 4: Single feature model with two Gaussian distributions. [2 Marks]¶
Now we will use two different distributions for fraudulent and non-fraudulent transactions.
Fit a Gaussian distribution ($G_1$) on a feature of non-fraudulent transactions using sklearn.mixture.GaussianMixture when n_components=1. [0.25]
Fit another Gaussian distribution ($G_2$) on the same feature but for fraudulent transactions using sklearn.mixture.GaussianMixture when n_components=1. [0.25]
Compute the score samples ($S$) for both $G_1$ and $G_2$ on the validation set to get $S_1$ and $S_2$, respectively. [0.25]
Find an optimal $c$ (a real number) that maximizes validation set F1 Score for a model such that if $S_1 < c \times S_2$, the transaction is classified as a fraud. For example, if $c=1$ we could say that if $S_2$ is greater than $S_1$, ($S_1$<$S_2$) then the transaction is a fraud (the transaction belongs to the $G_2$ distribution which represents fraudulent transactions). For start consider $c$ in $[0,10]$ with steps of 0.1, you can change this window in your experiments if needed. [0.25]
Repeat the steps above for all the features. What is the best F1 Score that you get for training and validation? Which feature and what c? [1]
In [ ]:
### YOUR CODE HERE ###
Part 5: Multivariate and Mixture of Gaussians Distribution [4 Marks]¶
We now want to build an outlier detection model that performs well in terms of F1 score. To design your model, you can benefit from:
No restrictions on the number of features - use as few or as many as you want! (multivariate).
To fit your model, you can take advantage of the Gaussian mixture model, where you can set the number of components help.
You can choose to fit your Gaussians on non-fraudulent transactions or to both classes.
It is up to you how to design your model. Try at least 10 different models and report the AUC for both training and validation sets (if applicable) and the best F1 score for both training and validation sets for each model. What kind of model works better? How many features are best (and which ones)? How many Gaussians? How many components? Summarize your findings with tables or plots. [4]
HINT !
You might want to try a two-gaussian model, multiple features, a single component for the valid transaction, and multiple components for fraudulent ones! Why does it make sense to have multiple components for fraudulent transactions?
In [ ]:
### YOUR CODE HERE ###
Part 6: Evaluating performance on test set: [1 Mark]¶
Which model worked better? Pick your best model among all models and apply it to your test set. Report the F1 Score, precision and recall on the test set. [1]
In [ ]:
### YOUR CODE HERE ###
Part 7: Is Gaussian the only useful distribution? [1 Mark]¶
Search for other distributions that could be used to model the data. How popular are they? Is there a specific situation where a distribution works better? How can we find a suitable distribution to model our data? Do not forget to include your references.