dp-1600-checkpoint
I. Data Cleaning and Preprocessing (for dataset A)¶
In [1]:
import scipy.io as sio
import pandas as pd
import numpy as np
from scipy import stats
In [2]:
arr = sio.loadmat(‘DataA.mat’)
In [3]:
fea = arr[‘fea’]
In [4]:
fea.shape
Out[4]:
(19000, 81)
In [5]:
fea
Out[5]:
array([[-153., 414., 939., …, -29., 36., 24.],
[-150., 420., 939., …, -31., 47., 3.],
[-160., 432., 941., …, -38., 20., 0.],
…,
[ nan, nan, nan, …, nan, nan, nan],
[ nan, nan, nan, …, nan, nan, nan],
[ nan, nan, nan, …, nan, nan, nan]])
In [7]:
df = pd.DataFrame(fea);
1. Detect any problems that need to be fixed in dataset A. Report such problems.¶
In [8]:
df[df.isnull().any(axis=1)].shape
Out[8]:
(19000, 81)
In [9]:
df[df.isnull().all(axis=1)].shape
Out[9]:
(773, 81)
In [31]:
dfAfterRemoveEmptyRows = df.ix[~(df.isnull().all(axis=1))]
processed = dfAfterRemoveEmptyRows.fillna(df.mean())
(np.abs(stats.zscore(processed)) > 4).sum()
Out[31]:
2778
The dataset has 19000 rows and 81 columns. All rows have at least 1 missing value. There are 773 rows whose 81 values are all missing.¶
2. Fix the detected problems using some of the methods discussed in class.¶
I remove the 773 rows whose 81 values are all missing and fill other missing value with the mean value of that column.¶
In [29]:
dfAfterRemoveEmptyRows = df.ix[~(df.isnull().all(axis=1))]
processed = dfAfterRemoveEmptyRows.fillna(df.mean())
processed
processed[(np.abs(stats.zscore(processed)) < 3).all(axis=1)] (np.abs(stats.zscore(processed)) > 4).any(axis=1).sum()
Out[29]:
1675
3. Normalize the data using min-max and z-score normalization. Plot histograms of feature 9 and 24; compare and comment on the differences before and after normalization. For both features, plot auto-correlation before and after normalizations and report and discuss observations.¶
In [118]:
from sklearn import preprocessing
minMaxScaler = preprocessing.MinMaxScaler()
minMaxScaledDf = pd.DataFrame(minMaxScaler.fit_transform(processed))
zscoreScaler = preprocessing.StandardScaler()
zscoreDf = pd.DataFrame(zscoreScaler.fit_transform(processed))
zscoreDf = pd.DataFrame(preprocessing.scale(processed))
minMaxScaledDf
Out[118]:
0 1 2 3 4 5 6 7 8 9 … 71 72 73 74 75 76 77 78 79 80
0 0.557580 0.374778 0.610278 0.609929 0.602502 0.477283 0.446254 0.599126 0.719485 0.443916 … 0.845426 0.329853 0.306536 0.185073 0.506142 0.500922 0.459055 0.487539 0.405682 0.346596
1 0.558230 0.376551 0.610278 0.601824 0.602984 0.479157 0.447231 0.594211 0.713235 0.450926 … 0.845426 0.334152 0.305229 0.184065 0.506142 0.500825 0.458473 0.487403 0.406439 0.344229
2 0.556062 0.380095 0.610668 0.609422 0.590472 0.476815 0.450163 0.592572 0.712868 0.446169 … 0.845426 0.338452 0.302614 0.183056 0.502677 0.500340 0.457164 0.486926 0.404582 0.343891
3 0.553676 0.380095 0.604828 0.603343 0.598653 0.484778 0.453746 0.582742 0.721324 0.457186 … 0.845426 0.343980 0.300000 0.182552 0.503307 0.500437 0.457745 0.486041 0.405476 0.343778
4 0.553676 0.458731 0.608332 0.595745 0.601059 0.479625 0.450163 0.594211 0.725735 0.463696 … 0.845426 0.349509 0.296078 0.181543 0.502677 0.501116 0.459200 0.486790 0.403275 0.344342
5 0.553676 0.380095 0.607358 0.600811 0.604427 0.470726 0.448208 0.597488 0.720588 0.472208 … 0.845426 0.355651 0.292810 0.180535 0.501732 0.500631 0.459636 0.487063 0.405407 0.344905
6 0.554110 0.379209 0.612225 0.602837 0.602502 0.478689 0.453420 0.580557 0.715441 0.487231 … 0.845426 0.362408 0.287582 0.180030 0.499843 0.500049 0.458182 0.487675 0.405132 0.344229
7 0.556062 0.377437 0.607942 0.592705 0.597209 0.488525 0.445277 0.586565 0.708088 0.480721 … 0.845426 0.369779 0.283660 0.179526 0.506142 0.499854 0.456873 0.486790 0.404719 0.347498
8 0.555411 0.380095 0.608332 0.601317 0.596246 0.478220 0.454072 0.598034 0.705882 0.490235 … 0.844900 0.377150 0.279739 0.178517 0.502362 0.500631 0.457891 0.487199 0.402724 0.345582
9 0.556929 0.375074 0.609694 0.597264 0.605871 0.482904 0.451140 0.605680 0.700000 0.499249 … 0.844374 0.383907 0.274510 0.178013 0.503307 0.500922 0.456727 0.487539 0.403619 0.345582
10 0.557580 0.374483 0.607164 0.596758 0.595765 0.473536 0.451792 0.611141 0.700735 0.507011 … 0.843323 0.391892 0.270588 0.176500 0.506772 0.501213 0.458473 0.487335 0.404169 0.349076
11 0.554327 0.374188 0.603465 0.593212 0.593840 0.484778 0.450163 0.605680 0.695956 0.515523 … 0.841220 0.399263 0.266667 0.173979 0.506772 0.500340 0.458909 0.486041 0.403894 0.350428
12 0.554760 0.383048 0.607748 0.594732 0.599615 0.484309 0.449837 0.600218 0.705882 0.538808 … 0.838591 0.406020 0.262092 0.171962 0.503622 0.499854 0.457891 0.486313 0.402380 0.354148
13 0.555628 0.384525 0.606580 0.581054 0.599134 0.482436 0.438111 0.599126 0.691176 0.533550 … 0.834911 0.412776 0.258824 0.167927 0.505512 0.500340 0.457891 0.486926 0.403412 0.355275
14 0.550857 0.383048 0.610668 0.571429 0.595284 0.498361 0.437134 0.594757 0.686765 0.546319 … 0.830179 0.418305 0.256209 0.163389 0.510866 0.501116 0.458618 0.484611 0.403275 0.356515
15 0.556712 0.378618 0.607553 0.567376 0.592397 0.490398 0.440391 0.590388 0.676838 0.551327 … 0.824395 0.424447 0.252941 0.158850 0.507717 0.501408 0.456727 0.484407 0.405270 0.360122
16 0.556495 0.378618 0.603660 0.581054 0.594321 0.482904 0.444300 0.618788 0.701838 0.562344 … 0.818086 0.431818 0.249673 0.153303 0.503307 0.503155 0.456436 0.482092 0.409122 0.361587
17 0.557580 0.385706 0.600545 0.559777 0.583253 0.483372 0.445277 0.628072 0.690074 0.577116 … 0.811251 0.437961 0.245752 0.148260 0.501732 0.502961 0.457600 0.483317 0.408503 0.362376
18 0.558230 0.383343 0.608137 0.568389 0.583734 0.491803 0.435831 0.605134 0.710662 0.595393 … 0.803365 0.444103 0.240523 0.143217 0.498583 0.503446 0.458182 0.484339 0.408709 0.354035
19 0.558013 0.383048 0.608526 0.559777 0.594803 0.474005 0.433876 0.610595 0.708824 0.589634 … 0.796004 0.451474 0.235948 0.138174 0.499843 0.503835 0.457891 0.485496 0.409604 0.353584
20 0.557580 0.388955 0.610473 0.563323 0.588065 0.494145 0.435831 0.616057 0.693750 0.601152 … 0.788118 0.458845 0.233987 0.132627 0.503307 0.503835 0.455709 0.486926 0.410085 0.351555
21 0.553893 0.388069 0.607942 0.566363 0.579403 0.493208 0.434853 0.592026 0.689706 0.588883 … 0.780231 0.465602 0.230065 0.128089 0.503937 0.504223 0.452364 0.487199 0.412424 0.352232
22 0.558664 0.386297 0.607164 0.552685 0.595284 0.497424 0.430619 0.601311 0.681250 0.611918 … 0.771293 0.472359 0.225490 0.123046 0.506457 0.505388 0.449309 0.487063 0.415451 0.351443
23 0.557363 0.385115 0.608137 0.552685 0.574110 0.484778 0.429642 0.601311 0.683824 0.633951 … 0.762355 0.480344 0.220261 0.119012 0.507402 0.505776 0.449309 0.484747 0.418685 0.354711
24 0.554977 0.388955 0.606190 0.538501 0.586622 0.491335 0.428013 0.597488 0.682353 0.600651 … 0.753943 0.491400 0.213072 0.116490 0.497638 0.506067 0.454255 0.486654 0.420267 0.350090
25 0.550206 0.393975 0.607748 0.537487 0.582772 0.488056 0.424430 0.587657 0.680882 0.610416 … 0.745531 0.502457 0.209150 0.112456 0.490394 0.504223 0.456436 0.484475 0.414282 0.343778
26 0.554110 0.392203 0.611836 0.536981 0.582291 0.488525 0.422476 0.595849 0.683456 0.615924 … 0.738170 0.513514 0.206536 0.108926 0.494488 0.500631 0.453091 0.489037 0.411599 0.342989
27 0.553459 0.394861 0.610668 0.532421 0.573147 0.507260 0.428664 0.576734 0.686029 0.639960 … 0.730810 0.525184 0.203268 0.106404 0.492283 0.502184 0.452218 0.491897 0.410154 0.342989
28 0.546736 0.396043 0.605996 0.532421 0.585659 0.498829 0.422150 0.590388 0.686765 0.608162 … 0.723975 0.535627 0.200654 0.104387 0.489764 0.503738 0.451055 0.494212 0.407402 0.339044
29 0.547387 0.397519 0.602297 0.533435 0.588547 0.488525 0.424756 0.600218 0.680515 0.614171 … 0.717140 0.549140 0.192810 0.104892 0.504882 0.505097 0.449745 0.488424 0.404169 0.341749
… … … … … … … … … … … … … … … … … … … … … …
18197 0.561958 0.458731 0.543808 0.639818 0.583253 0.465105 0.516287 0.688149 0.535662 0.486229 … 0.413249 0.148649 0.554902 0.884518 0.491654 0.522474 0.460073 0.495846 0.405270 0.343778
18198 0.561958 0.458731 0.543808 0.637791 0.661694 0.470726 0.514658 0.081376 0.597059 0.481472 … 0.413775 0.149263 0.555556 0.884518 0.486929 0.522376 0.459200 0.493259 0.406714 0.341298
18199 0.561958 0.458731 0.543808 0.691489 0.117902 0.430913 0.534845 0.572536 0.627077 0.484226 … 0.413249 0.148649 0.556209 0.884518 0.490394 0.521600 0.460218 0.488152 0.405270 0.341862
18200 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.480471 … 0.413775 0.148649 0.555556 0.884518 0.491969 0.520823 0.461236 0.486994 0.404857 0.344680
18201 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.486980 … 0.414301 0.148649 0.555556 0.884518 0.491339 0.521600 0.459345 0.487743 0.402999 0.346032
18202 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.448673 … 0.414826 0.148649 0.555556 0.884518 0.493228 0.521794 0.457309 0.487812 0.401968 0.345356
18203 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.415878 0.148649 0.556209 0.884518 0.493858 0.521017 0.456873 0.486790 0.402243 0.348738
18204 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.416404 0.148034 0.556209 0.884518 0.493228 0.521017 0.457455 0.491012 0.401968 0.350766
18205 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417455 0.148034 0.556209 0.885023 0.491654 0.522862 0.457745 0.492306 0.403756 0.355726
18206 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417981 0.148034 0.556209 0.885023 0.491339 0.523541 0.459200 0.488084 0.401830 0.355951
18207 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417981 0.148034 0.555556 0.885023 0.492283 0.522765 0.457600 0.486790 0.401692 0.353359
18208 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417981 0.148649 0.555556 0.885527 0.494173 0.522474 0.457455 0.487948 0.402655 0.354148
18209 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417981 0.148649 0.555556 0.885527 0.493858 0.522376 0.457600 0.488424 0.401692 0.348399
18210 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417981 0.148649 0.554902 0.885527 0.495433 0.522668 0.457455 0.489446 0.401555 0.347385
18211 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417981 0.148649 0.554902 0.885527 0.492598 0.523347 0.455855 0.489990 0.403550 0.347385
18212 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.418507 0.148649 0.554902 0.885527 0.488189 0.523541 0.456727 0.492033 0.405270 0.346709
18213 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.418507 0.149263 0.555556 0.885527 0.487874 0.524706 0.459345 0.493531 0.404238 0.347047
18214 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417981 0.149263 0.556209 0.885527 0.492598 0.524512 0.459491 0.489990 0.401968 0.350654
18215 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417981 0.149263 0.555556 0.885527 0.493858 0.524027 0.458764 0.486790 0.402931 0.348850
18216 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417455 0.148649 0.554902 0.885527 0.491339 0.523833 0.456436 0.485224 0.400248 0.352232
18217 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417455 0.148649 0.554248 0.885527 0.490709 0.523833 0.457309 0.485224 0.402243 0.352570
18218 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417981 0.149263 0.554248 0.885527 0.492283 0.523056 0.458182 0.487403 0.401968 0.351555
18219 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.417981 0.149263 0.554248 0.885527 0.492913 0.523833 0.458182 0.487675 0.399904 0.351105
18220 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.418507 0.148649 0.554902 0.885527 0.492913 0.524027 0.456727 0.488424 0.400660 0.354035
18221 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.418507 0.148649 0.554902 0.885527 0.492283 0.524998 0.457455 0.487812 0.401830 0.350203
18222 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.418507 0.148649 0.554902 0.885527 0.492913 0.523930 0.458182 0.486109 0.402174 0.347498
18223 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.419033 0.148649 0.554902 0.885527 0.494803 0.523347 0.458473 0.488084 0.403068 0.351330
18224 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.419033 0.148649 0.554248 0.885527 0.494173 0.523736 0.458182 0.486382 0.403687 0.348625
18225 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.419033 0.148649 0.552941 0.885527 0.489764 0.524027 0.457455 0.486245 0.402793 0.348738
18226 0.561958 0.458731 0.543808 0.535903 0.555605 0.415569 0.534845 0.572536 0.627077 0.537967 … 0.418507 0.148649 0.552288 0.885527 0.490394 0.524027 0.458036 0.487267 0.403137 0.348287
18227 rows × 81 columns
In [120]:
import matplotlib.pyplot as plt
for featureInd in (9,24):
# featureInd = 9
processed.ix[:, featureInd – 1].hist()
plt.title(‘Feature %d histogram before normalization’ % featureInd)
plt.show()
plt.figure()
minMaxScaledDf.ix[:, featureInd – 1].hist()
plt.title(‘Feature %d histogram after min-max normalization’ % featureInd)
plt.show()
plt.figure()
zscoreDf.ix[:, featureInd – 1].hist()
plt.title(‘Feature %d histogram after z-score normalization’ % featureInd)
plt.show()
We can see that min-max normalization makes the value the range becomes 0-1, and the z-score normalization makes the mean become 0 and the standard deviation become 1.¶
In [121]:
from pandas.tools.plotting import autocorrelation_plot
for featureInd in (9,24):
autocorrelation_plot(processed.ix[:, featureInd-1])
plt.title(‘Feature %d auto-correlation plot before normalization’ % featureInd)
plt.show()
autocorrelation_plot( minMaxScaledDf.ix[:, featureInd – 1])
plt.title(‘Feature %d auto-correlation plot after min-max normalization’ % featureInd)
plt.show()
autocorrelation_plot( zscoreDf.ix[:, featureInd – 1])
plt.title(‘Feature %d auto-correlation plot after z-score normalization’ % featureInd)
plt.show()
We can see from the plot that feature 24 has more randomness than feature 9, because its values nearer to 0.¶
II. Feature Extraction (for dataset B)¶
1. Use PCA as a dimensionality reduction technique to the data, compute the eigenvectors and eigenvalues.¶
read the data¶
In [122]:
arr = sio.loadmat(‘DataB.mat’)
feaDf = pd.DataFrame(arr[‘fea’]);
gndDf = pd.DataFrame(arr[‘gnd’]);
1. Use PCA as a dimensionality reduction technique to the data, compute the eigenvectors and eigenvalues.¶
In [123]:
from sklearn.decomposition import PCA
pca = PCA()
trans = pca.fit(feaDf).transform(feaDf)
trans = pd.DataFrame(trans)
trans
Out[123]:
0 1 2 3 4 5 6 7 8 9 … 774 775 776 777 778 779 780 781 782 783
0 -1069.166304 -513.973184 -139.243261 878.387704 387.873484 -335.304982 -189.857033 -312.580046 -111.445704 -125.322693 … -1.866970e+00 -0.084891 0.020108 -1.687018 0.576924 -0.338576 -0.041519 -0.533611 -0.683888 -0.238770
1 -1099.176077 -570.842223 -67.311779 839.381070 345.573249 -530.737220 -516.056930 -73.720660 -9.525282 -329.657997 … -6.898612e-01 0.453774 0.425908 0.357020 0.328161 1.778184 -0.232896 0.092560 -1.472755 0.647717
2 -673.201385 -167.377150 480.988638 83.823068 1036.833666 76.531663 -184.553283 -311.406005 122.135438 444.768606 … -1.616355e+00 0.075662 1.281379 -0.806874 -0.852442 1.466010 0.901360 0.563870 0.318705 1.407347
3 -1010.903339 -187.044145 506.352247 426.446929 901.897549 73.661148 -316.674873 -617.909186 217.415113 135.694758 … -1.582237e-01 -1.132823 -1.522555 -0.132399 0.735122 -1.769569 -0.331854 -0.490687 0.869740 0.179703
4 -1692.970822 -633.369398 -521.943052 367.356716 -6.919257 -601.851221 -515.146878 325.773643 -262.282233 321.254330 … 2.484220e-01 -0.019291 0.059462 -1.723846 0.548581 -0.418587 0.177183 -0.689386 -0.309217 -0.016248
5 -1341.694310 -536.770010 -578.489504 246.425866 -577.933576 140.191582 -698.380687 -144.262094 -486.957581 314.085981 … 3.222658e-01 -2.201212 -0.003722 0.368970 0.345119 1.186437 -0.132119 0.063865 -0.541003 1.614036
6 -1217.832623 -521.312900 -116.355397 240.552050 680.648563 -501.999433 -913.376790 18.253232 -33.072870 98.394066 … 5.214346e-01 1.008699 0.564041 0.085898 -0.202696 -0.671090 0.122620 -0.738915 -0.000874 0.145653
7 -226.761138 -457.324861 -284.321899 128.615747 -1017.001492 -87.023369 385.875869 32.041779 183.274818 -29.290437 … 1.047898e-01 -0.174483 1.777861 -0.356244 -0.498399 0.475913 -0.368573 -0.347439 -0.579300 -0.607482
8 -1219.222302 -479.274443 8.018715 -521.169429 722.007449 195.983302 -449.664821 -658.803486 -78.579125 555.612551 … -7.058616e-01 0.521618 0.213487 -1.125360 -0.158788 -0.212024 -0.549318 0.846626 -0.427703 0.233331
9 -900.753262 -548.126694 30.258637 46.963719 761.675828 -435.396163 -319.884310 -405.824072 225.253721 406.694260 … -7.490606e-02 -0.272587 1.063569 -1.476583 0.757261 1.472265 -0.302958 0.973383 0.662131 1.881623
10 -1119.444673 -430.540486 -4.091402 150.580695 1045.936651 -171.263259 -6.094476 -381.340636 16.295049 487.386055 … -4.710084e-01 -0.239910 1.340985 -0.046804 -0.900036 -1.693041 -1.518728 -0.795529 -1.255490 0.255158
11 -1132.061141 -688.199951 -534.771525 -8.874299 -914.504627 85.645282 -83.685058 -47.217069 -137.197284 248.664230 … 7.205271e-01 0.925305 -0.811268 0.315662 0.271964 -0.632907 0.927693 0.221857 0.990271 0.121421
12 -1283.397863 -396.151025 187.789686 40.171936 787.148009 63.436551 -65.596336 -775.291061 -320.577712 328.230419 … -1.777156e-01 -0.382525 -1.624642 1.565395 0.852080 0.908979 -1.297958 0.301793 1.354985 -0.085096
13 -431.797862 -454.763166 -364.204768 166.199142 -1016.920930 280.651043 59.963980 230.539038 165.544323 -26.678097 … 8.305265e-01 0.604595 0.111757 -0.406335 -0.416129 0.158928 0.705899 0.239840 0.286835 -0.034404
14 -1368.131762 -641.515614 -732.070197 619.218875 -244.318899 -49.976603 -702.191339 207.956919 -256.934176 -546.515302 … -6.548304e-01 0.282450 0.412267 1.090183 0.079337 0.011269 -0.282651 -0.051128 0.802942 0.268547
15 -784.250675 -419.946331 -589.587363 470.526795 -85.742394 -123.237859 23.330085 -537.007303 -126.336548 332.516210 … 1.391425e-01 -0.365961 -1.186444 -0.549488 -0.086847 -0.128104 0.903247 1.348424 -1.642303 0.503692
16 -1455.821769 -519.039116 109.383460 961.047935 647.104030 -564.693429 -462.231542 57.661025 74.874971 -423.976737 … 4.267662e-01 0.742334 0.588549 -0.518502 0.615122 -0.527216 0.562501 0.802088 -0.589751 -0.310724
17 -1406.742684 -1010.360551 -236.296620 302.702769 -1052.810187 99.689728 -45.212342 411.919088 -381.223891 -281.797719 … 1.319060e+00 -0.010289 0.673363 0.252073 0.227497 -1.407226 0.722688 0.169789 -0.250250 -0.530935
18 -809.569532 -355.676642 -553.061747 739.156447 -342.576671 -298.188245 -150.300975 337.326239 -204.026927 -697.361484 … -1.111147e+00 0.314282 0.451578 -0.263198 -0.284767 0.563432 0.453698 -0.228681 1.077237 -0.722921
19 -1009.720480 -500.758620 148.698733 -206.740212 942.652635 179.051437 -228.104703 -743.129840 -68.426814 349.120365 … 4.569068e-01 -0.500428 1.008004 -0.122772 -0.236029 -0.298242 0.531056 0.684323 -0.494935 0.597317
20 -1256.737747 -136.282075 4.767105 401.171633 111.170531 20.126463 -60.401593 -835.939615 -133.735891 108.127751 … -1.958109e-01 0.131444 0.544740 0.151017 0.592460 -0.282234 -0.572113 -0.210930 0.276628 -1.238902
21 -1097.738331 -791.366616 -662.612802 416.147079 -787.374203 -393.630293 -93.053146 802.644413 -187.720560 291.655381 … -6.627231e-01 -0.811848 1.419320 0.121333 -0.286140 -0.397351 0.966862 -1.184955 0.884866 -0.701912
22 -1237.245760 -1011.960330 -384.654272 69.297448 -1141.412452 115.369106 99.890392 534.396101 -155.357462 -128.501289 … -4.123383e-01 0.881303 0.117137 0.500842 0.908992 -0.309411 0.530647 -1.708357 -1.409054 -0.486977
23 -337.006350 -166.084273 -519.665873 130.820076 -403.671160 -213.609312 144.069046 -656.301161 -230.664488 536.355335 … 3.703482e-02 -0.809039 0.710208 -0.477101 -0.461453 0.130055 -0.794128 -0.242733 0.514527 1.470082
24 -622.921863 -177.985457 -414.160463 -401.750020 2.516656 192.630114 -265.215019 -840.935255 82.878453 675.829133 … 5.356892e-02 -0.620902 -0.795742 -0.786351 -1.517372 -0.256620 -0.223543 -1.535969 -0.298616 -1.064983
25 -1058.479796 -679.243567 -567.416761 49.160913 -884.302279 17.978966 -185.622034 235.783387 -370.680238 -13.668915 … 4.232970e-01 -0.161277 1.130549 -0.797970 1.364715 -1.146128 0.305532 -0.714718 0.808577 -1.161876
26 -170.420127 205.716202 -134.305883 -28.439419 90.064382 81.864711 448.946720 -712.161197 -128.578934 -9.634653 … 6.317750e-01 0.744909 -0.761755 1.505177 -0.792133 -0.538672 -0.472992 -0.965063 0.305972 1.582343
27 -937.063555 -120.700099 -159.267905 29.237288 -114.534026 -366.615037 296.334181 -433.097567 -314.105184 178.210137 … 2.932601e-01 0.123222 -0.123823 -0.561385 -0.483566 0.902943 0.240104 -1.721182 -1.308013 0.787594
28 -232.650631 -283.345693 -543.440591 257.837187 -835.285216 -363.820625 320.100815 371.850520 72.632553 116.720545 … 5.947035e-01 0.489138 0.135854 0.377050 -0.088552 -1.708275 -0.113474 -0.242878 -0.449452 -0.756640
29 -899.559652 -780.081817 -483.711259 169.414276 -1077.021039 151.471500 111.863190 468.177092 -249.953559 -95.826977 … -1.674006e+00 -0.119436 -0.854108 -0.077645 1.172576 -0.003654 -0.582229 0.366777 -0.396206 -0.394839
… … … … … … … … … … … … … … … … … … … … … …
2036 224.500021 717.688537 -195.952977 -499.323792 -267.286036 -368.337387 -89.032554 -181.573433 -347.223541 -510.053937 … 6.726383e-02 -0.202020 -0.312123 1.105970 -0.206732 0.108251 -0.842430 0.523495 -0.107648 -0.336346
2037 -435.315514 1060.086518 -252.240653 -784.207808 17.898867 -426.553471 96.977868 456.511082 381.245755 8.698889 … 1.970480e-02 0.112328 -1.068654 -0.055607 -0.149124 0.520174 0.150830 1.034539 0.438916 0.351792
2038 238.402523 821.431429 -296.668932 72.066998 302.033654 -169.441235 256.897351 -41.170547 -210.214010 325.530664 … -9.770797e-01 1.148939 -2.066749 0.544490 -0.854579 -1.099545 0.776323 0.455741 0.945283 -2.146525
2039 479.799398 304.242620 -530.251819 -263.518464 -18.481148 -269.480921 310.804690 -239.886632 -109.217451 -437.991055 … -7.576343e-02 -0.415398 -0.902072 -0.711554 0.828067 -0.049272 -0.621797 -0.561703 -0.781354 -0.586075
2040 -85.398938 899.484460 -359.165698 -478.939958 30.973590 -421.339300 339.919452 323.074595 201.267312 -77.545571 … 1.549309e+00 -1.805956 0.851106 -0.820759 0.520234 1.442292 -0.333977 -0.059679 0.935662 -0.216852
2041 -975.896906 891.969722 -658.015458 -273.613739 -529.811171 -94.804905 -506.695143 702.175432 632.401578 -159.268337 … 1.413062e+00 1.833163 0.717416 0.773201 -1.264493 1.683655 0.559883 -1.457644 -0.459947 -0.372735
2042 -626.495241 838.788408 -639.081008 212.654848 -340.914705 80.270632 142.579529 303.414827 151.562446 -281.652763 … 4.894657e-01 0.865822 0.359183 -0.426559 -0.453866 0.468343 0.161468 2.170382 0.645651 0.817388
2043 258.028617 579.227378 -335.781308 -127.985693 149.696587 -147.257143 -44.205310 -408.000293 -113.630554 -386.437055 … -5.367552e-01 -0.552712 0.880173 0.040852 0.800064 0.183633 -0.217447 -0.116964 0.737840 -0.348819
2044 252.991333 361.106673 -45.994668 -275.875656 123.818071 46.475332 -175.237808 -95.136679 500.235924 -264.320536 … 1.005870e-01 -2.250669 0.806874 -1.290289 0.086770 1.047588 -0.848870 1.283855 1.267557 1.054427
2045 -43.933473 596.114295 -360.973687 -150.061260 152.710758 44.882070 -320.612956 -406.928686 60.535033 -410.485002 … -9.133447e-01 -0.161106 -0.257986 -0.576165 -0.714516 0.778803 -1.023963 -1.503650 0.094508 -0.331217
2046 8.716199 710.432025 -23.709776 -105.468206 -499.729993 -119.021552 141.187623 -197.607646 -66.204958 -517.241369 … 9.633770e-01 -1.157743 -0.772906 -1.253108 2.021171 -1.480527 -1.078181 0.298750 0.205029 0.034573
2047 432.475888 548.321160 -174.616090 99.150459 -3.700452 -295.330003 -290.907181 -243.868960 -344.828150 43.844256 … 4.301049e-01 0.841631 2.026202 -0.539848 1.312115 -0.349677 -1.298490 0.301358 -0.927979 0.103058
2048 -164.177255 959.226629 -239.753319 -546.282908 -434.398609 -269.636934 177.483358 9.427421 130.772654 -108.527800 … 7.257674e-01 0.135984 1.329049 0.403042 -2.722151 -0.649012 1.308447 0.930628 0.785000 -0.567963
2049 19.806668 944.896452 -408.627564 -141.635179 282.084948 -27.051497 315.504355 -208.362093 -476.937131 -10.460458 … -9.347001e-01 0.067785 1.434866 0.927551 0.308521 1.481449 0.374957 -0.488117 -0.770377 0.061573
2050 260.650271 726.328779 -374.089981 326.390047 -251.397107 -172.193151 183.226086 -97.919999 -167.813291 256.801816 … 3.585923e-01 0.430714 -1.042078 0.456595 0.547368 -0.264048 -0.083985 0.502801 -0.227676 0.312044
2051 337.640613 568.111744 -358.456523 -207.139564 -262.826142 -492.078627 232.162057 -133.308352 7.725171 -443.847330 … 9.243614e-01 1.042489 1.283304 0.953035 0.530451 -0.932297 1.773926 1.435049 -0.678990 0.364603
2052 111.635685 782.005670 -337.485075 -404.216222 158.940143 -295.307423 130.147134 -42.810337 -651.072148 -255.777814 … 3.471015e-01 0.741243 0.358033 -0.518499 -0.947758 0.838668 1.031081 -0.240294 -0.656091 -0.037359
2053 575.789733 321.964107 193.327055 -4.403837 546.809983 104.206169 -729.361741 -38.931700 190.821083 309.340733 … -8.631813e-01 -0.189686 -0.567396 -0.692963 0.928563 -1.084058 0.752349 1.116688 -0.872614 -1.054835
2054 -56.731706 983.452324 -101.364217 -575.686798 191.904316 -464.280977 -101.488108 -121.022730 -528.247937 -399.415584 … 3.581617e-01 2.287442 0.293284 -0.684196 0.547065 -0.151886 0.388458 -0.212971 0.073012 0.848820
2055 391.142965 529.109716 -360.665581 34.201602 227.983142 -82.175911 -262.703110 -301.288813 77.284174 -387.278903 … -5.823842e-07 0.139710 0.748517 -1.735751 -1.033248 0.042045 -0.536040 -1.827261 0.504767 -0.127187
2056 222.286099 790.011945 -279.594895 -583.068983 -311.732657 -212.255003 241.094219 3.825699 386.429062 -173.589168 … 5.083759e-02 -0.382053 0.831296 0.080013 0.510445 0.422860 -0.379502 -0.065517 0.320467 -1.119466
2057 432.180822 776.933460 49.309403 -306.113951 153.785454 -115.321890 -310.977746 -362.917298 -359.629838 -390.862098 … 9.127987e-01 -0.150100 -1.416913 -1.180253 -0.216539 -1.985620 -0.154041 -0.090064 0.542056 -0.034879
2058 -141.001824 566.194886 -347.955180 69.169126 -373.710178 241.262334 167.023778 -555.740141 458.383106 119.538288 … -2.816329e-01 0.386191 -1.443055 0.216804 0.515054 0.060603 0.978413 0.059415 -0.653913 -0.633914
2059 -467.623465 718.782394 -152.832548 -135.848772 344.854140 76.905198 -440.266244 -330.318037 216.443552 -299.199686 … -2.566683e-02 0.130000 -0.893729 -1.067120 0.430176 -0.448090 -0.033895 0.100113 -0.050692 -1.174197
2060 -514.275826 1049.482202 -256.796813 -906.542691 -121.841814 -276.421567 -9.396248 266.499936 40.353150 -444.488705 … -8.330565e-01 -0.300031 0.141213 -0.250310 -0.635696 0.332390 0.351759 0.119501 0.471480 0.221765
2061 24.355662 742.490057 -467.186538 -255.047804 -157.302478 -239.561076 281.447738 -379.925546 4.622007 -6.770294 … -3.704237e-01 1.090093 1.351715 0.539548 -0.619355 -0.829358 -0.392200 1.028506 -1.373923 0.116358
2062 -48.768593 734.458335 -334.353122 -496.721083 117.983630 -467.477372 293.985916 278.772911 206.289920 -273.060815 … 1.748901e+00 -1.171978 1.029203 -0.349139 -0.750915 -0.464229 -0.226356 0.255568 -0.746874 -0.033951
2063 -131.021601 866.607035 -397.861565 -248.089962 45.492451 -93.904547 425.965974 -9.764928 -422.312250 5.656109 … 1.476022e+00 1.406747 -0.057995 -0.510126 0.272517 0.230043 -0.197592 -1.201246 0.144962 -0.737190
2064 262.141229 652.777351 -347.602739 72.427962 -80.070774 -164.182531 -48.367170 -300.056899 -268.817282 -106.874623 … -1.011525e+00 0.900132 -0.416246 -0.330704 0.245400 -0.554359 -0.735933 -0.625400 -1.717483 -0.719203
2065 480.891094 432.743142 18.124027 -364.053056 502.566777 428.276701 -406.951773 -60.115665 373.761359 -233.606997 … -4.350459e-02 0.444487 -0.265254 -0.102027 0.201297 -0.839082 -0.131206 -0.089335 -0.189487 -0.019911
2066 rows × 784 columns
2. Plot a 2 dimensional representation of the data points based on the first and second principal components. Explain the results versus the known classes (display data points of each class with a different color).¶
In [124]:
colors = [‘navy’, ‘turquoise’, ‘darkorange’, ‘red’, ‘green’]
classes = [0, 1, 2, 3, 4]
classesStr = [‘class ‘ + str(i) for i in classes]
for color, i, class_name in zip(colors, classes, classesStr):
t = trans[gndDf.ix[:,0] == i]
plt.scatter(t.ix[:,0], t.ix[:,1], color=color, alpha=.8,
label=class_name)
plt.legend(loc=’best’, shadow=False, scatterpoints=1)
plt.title(‘2 dimensional representation with first and second principal components’)
plt.show()
We can see that the classes have relative clear boundaries in the plot, because the PCA tries to maximize the variance.¶
3. Repeat step 2 for the 5th and 6st components. Comment on the result.¶
In [125]:
for color, i, class_name in zip(colors, classes, classesStr):
t = trans[gndDf.ix[:,0] == i]
plt.scatter(t.ix[:,4], t.ix[:,5], color=color, alpha=.8,
label=class_name)
plt.legend(loc=’best’, shadow=False, scatterpoints=1)
plt.title(‘2 dimensional representation with first and second principal components’)
plt.show()
Unlike the previous plot using the first 2 components, the classes doesn’t have clear separation in the plot. Because the combination of 5th and 6st components account fewer variance than the first 2 components.¶
4. Use the Naive Bayes classifier to classify 8 sets of dimensionality reduced data (using the first 2, 4, 10, 30, 60, 200, 500, and all 784 PCA components). Plot the classification error for the 8 sets against the retained variance (rm from lect3:slide22) of each case.¶
In [126]:
from sklearn.naive_bayes import GaussianNB
class_errors = []
retainedVars = []
for n in [2, 4, 10, 30, 60, 200, 500,784]:
gnb = GaussianNB()
y_pred = gnb.fit(trans.ix[:,:n], gndDf.ix[:,0]).predict(trans.ix[:,:n])
class_error = sum(y_pred != gndDf.ix[:,0])*1.0 / y_pred.shape[0]
class_errors.append(class_error)
retainedVars.append(sum(pca.explained_variance_[:n]) / sum(pca.explained_variance_))
plt.plot(retainedVars, class_errors)
plt.xlabel(‘retained variance’)
plt.ylabel(‘classication error’)
plt.title(‘Naive Bayes Classification Error’)
plt.show()
5. As the class labels are already known, you can use the Linear Discriminant Analysis (LDA) to reduce the dimensionality, plot the data points using the first 2 LDA components (display data points of each class with a different color). Explain the results obtained in terms of the known classes. Compare with the results obtained by using PCA.¶
In [127]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
transLda = lda.fit(feaDf, gndDf.ix[:,0]).transform(feaDf)
transLda = pd.DataFrame(transLda)
colors = [‘navy’, ‘turquoise’, ‘darkorange’, ‘red’, ‘green’]
classes = [0, 1, 2, 3, 4]
classesStr = [‘class ‘ + str(i) for i in classes]
for color, i, class_name in zip(colors, classes, classesStr):
t = transLda[gndDf.ix[:,0] == i]
plt.scatter(t.ix[:,0], t.ix[:,1], color=color, alpha=.8,
label=class_name)
plt.legend(loc=’best’, shadow=False, scatterpoints=1)
plt.title(‘2 dimensional representation with the first 2 LDA components’)
plt.show()
The LDA tries to output the components that maximize variance between the known classes, we can see from the plot that the separation of classes is clearer than the one got by PCA which tries to output the components that maximize variance in the whole data.¶
In [ ]: