程序代写代做代考 dp-1600

dp-1600

I. Data Cleaning and Preprocessing (for dataset A)¶

In [1]:

import scipy.io as sio
import pandas as pd
import numpy as np
from scipy import stats

In [2]:

arr = sio.loadmat(‘DataA.mat’)

In [3]:

fea = arr[‘fea’]

In [4]:

fea.shape

Out[4]:

(19000, 81)

In [5]:

fea

Out[5]:

array([[-153., 414., 939., …, -29., 36., 24.],
[-150., 420., 939., …, -31., 47., 3.],
[-160., 432., 941., …, -38., 20., 0.],
…,
[ nan, nan, nan, …, nan, nan, nan],
[ nan, nan, nan, …, nan, nan, nan],
[ nan, nan, nan, …, nan, nan, nan]])

In [6]:

df = pd.DataFrame(fea);

1. Detect any problems that need to be fixed in dataset A. Report such problems.¶

In [7]:

df[df.isnull().any(axis=1)].shape

Out[7]:

(19000, 81)

In [8]:

df[df.isnull().all(axis=1)].shape

Out[8]:

(773, 81)

In [9]:

dt1 = df.ix[~(df.isnull().all(axis=1))]
dt2 = dt1.fillna(dt1.mean())
(np.abs(stats.zscore(dt2)) > 4).sum()

Out[9]:

2778

The dataset has 19000 rows and 81 columns. All rows have at least 1 missing value. There are 773 rows whose 81 values are all missing. There are 2778 values that are 4 standard deviation from the mean which can be considered outliers.¶

2. Fix the detected problems using some of the methods discussed in class.¶
I remove the 773 rows whose 81 values are all missing and fill other missing value with the mean value of that column. Fill the outliers with column mean value.¶

In [10]:

dfAfterRemoveEmptyRows = df.ix[~(df.isnull().all(axis=1))]
processed = dfAfterRemoveEmptyRows.fillna(df.mean())

outliers = (np.abs(stats.zscore(processed)) > 4)

for i in range(outliers.shape[0]):
for j in range(outliers.shape[1]):
if outliers[i, j]:
processed.ix[i, j] = np.nan

processed = processed.fillna(processed.mean())
processed

Out[10]:

0 1 2 3 4 5 6 7 8 9 … 71 72 73 74 75 76 77 78 79 80
0 -153.000000 414.000000 939.000000 -161.000000 1007.000000 99.000000 -210.000000 948.000000 333.000000 -19.000000 … 655.0 -316.0 -302.0 -617.0 -955.0 -264.0 23.0 -29.0 36.0 24.0
1 -150.000000 420.000000 939.000000 -177.000000 1008.000000 103.000000 -207.000000 939.000000 316.000000 9.000000 … 655.0 -309.0 -304.0 -619.0 -955.0 -265.0 19.0 -31.0 47.0 3.0
2 -160.000000 432.000000 941.000000 -162.000000 982.000000 98.000000 -198.000000 936.000000 315.000000 -10.000000 … 655.0 -302.0 -308.0 -621.0 -966.0 -270.0 10.0 -38.0 20.0 0.0
3 -171.000000 432.000000 911.000000 -174.000000 999.000000 115.000000 -187.000000 918.000000 338.000000 34.000000 … 655.0 -293.0 -312.0 -622.0 -964.0 -269.0 14.0 -51.0 33.0 -1.0
4 -171.000000 698.264485 929.000000 -189.000000 1004.000000 104.000000 -198.000000 939.000000 350.000000 60.000000 … 655.0 -284.0 -318.0 -624.0 -966.0 -262.0 24.0 -40.0 1.0 4.0
5 -171.000000 432.000000 924.000000 -179.000000 1011.000000 85.000000 -204.000000 945.000000 336.000000 94.000000 … 655.0 -274.0 -323.0 -626.0 -969.0 -267.0 27.0 -36.0 32.0 9.0
6 -169.000000 429.000000 949.000000 -175.000000 1007.000000 102.000000 -188.000000 914.000000 322.000000 154.000000 … 655.0 -263.0 -331.0 -627.0 -975.0 -273.0 17.0 -27.0 28.0 3.0
7 -160.000000 423.000000 927.000000 -195.000000 996.000000 123.000000 -213.000000 925.000000 302.000000 128.000000 … 655.0 -251.0 -337.0 -628.0 -955.0 -275.0 8.0 -40.0 22.0 32.0
8 -163.000000 432.000000 929.000000 -178.000000 994.000000 101.000000 -186.000000 946.000000 296.000000 166.000000 … 654.0 -239.0 -343.0 -630.0 -967.0 -267.0 15.0 -34.0 -7.0 15.0
9 -156.000000 415.000000 936.000000 -186.000000 1014.000000 111.000000 -195.000000 960.000000 280.000000 202.000000 … 653.0 -228.0 -351.0 -631.0 -964.0 -264.0 7.0 -29.0 6.0 15.0
10 -153.000000 413.000000 923.000000 -187.000000 993.000000 91.000000 -193.000000 970.000000 282.000000 233.000000 … 651.0 -215.0 -357.0 -634.0 -953.0 -261.0 19.0 -32.0 14.0 46.0
11 -168.000000 412.000000 904.000000 -194.000000 989.000000 115.000000 -198.000000 960.000000 269.000000 267.000000 … 647.0 -203.0 -363.0 -639.0 -953.0 -270.0 22.0 -51.0 10.0 58.0
12 -166.000000 442.000000 926.000000 -191.000000 1001.000000 114.000000 -199.000000 950.000000 296.000000 360.000000 … 642.0 -192.0 -370.0 -643.0 -963.0 -275.0 15.0 -47.0 -12.0 91.0
13 -162.000000 447.000000 920.000000 -218.000000 1000.000000 110.000000 -235.000000 948.000000 256.000000 339.000000 … 635.0 -181.0 -375.0 -651.0 -957.0 -270.0 15.0 -38.0 3.0 101.0
14 -184.000000 442.000000 941.000000 -237.000000 992.000000 144.000000 -238.000000 940.000000 244.000000 390.000000 … 626.0 -172.0 -379.0 -660.0 -940.0 -262.0 20.0 -72.0 1.0 112.0
15 -157.000000 427.000000 925.000000 -245.000000 986.000000 127.000000 -228.000000 932.000000 217.000000 410.000000 … 615.0 -162.0 -384.0 -669.0 -950.0 -259.0 7.0 -75.0 30.0 144.0
16 -158.000000 427.000000 905.000000 -218.000000 990.000000 111.000000 -216.000000 984.000000 285.000000 454.000000 … 603.0 -150.0 -389.0 -680.0 -964.0 -241.0 5.0 -109.0 86.0 157.0
17 -153.000000 451.000000 889.000000 -260.000000 967.000000 112.000000 -213.000000 1001.000000 253.000000 513.000000 … 590.0 -140.0 -395.0 -690.0 -969.0 -243.0 13.0 -91.0 77.0 164.0
18 -150.000000 443.000000 928.000000 -243.000000 968.000000 130.000000 -242.000000 959.000000 309.000000 586.000000 … 575.0 -130.0 -403.0 -700.0 -979.0 -238.0 17.0 -76.0 80.0 90.0
19 -151.000000 442.000000 930.000000 -260.000000 991.000000 92.000000 -248.000000 969.000000 304.000000 563.000000 … 561.0 -118.0 -410.0 -710.0 -975.0 -234.0 15.0 -59.0 93.0 86.0
20 -153.000000 462.000000 940.000000 -253.000000 977.000000 135.000000 -242.000000 979.000000 263.000000 609.000000 … 546.0 -106.0 -413.0 -721.0 -964.0 -234.0 0.0 -38.0 100.0 68.0
21 -170.000000 459.000000 927.000000 -247.000000 959.000000 133.000000 -245.000000 935.000000 252.000000 560.000000 … 531.0 -95.0 -419.0 -730.0 -962.0 -230.0 -23.0 -34.0 134.0 74.0
22 -148.000000 453.000000 923.000000 -274.000000 992.000000 142.000000 -258.000000 952.000000 229.000000 652.000000 … 514.0 -84.0 -426.0 -740.0 -954.0 -218.0 -44.0 -36.0 178.0 67.0
23 -154.000000 449.000000 928.000000 -274.000000 948.000000 115.000000 -261.000000 952.000000 236.000000 740.000000 … 497.0 -71.0 -434.0 -748.0 -951.0 -214.0 -44.0 -70.0 225.0 96.0
24 -165.000000 462.000000 918.000000 -302.000000 974.000000 129.000000 -266.000000 945.000000 232.000000 607.000000 … 481.0 -53.0 -445.0 -753.0 -982.0 -211.0 -10.0 -42.0 248.0 55.0
25 -187.000000 479.000000 926.000000 -304.000000 966.000000 122.000000 -277.000000 927.000000 228.000000 646.000000 … 465.0 -35.0 -451.0 -761.0 -1005.0 -230.0 5.0 -74.0 161.0 -1.0
26 -169.000000 473.000000 947.000000 -305.000000 965.000000 123.000000 -283.000000 942.000000 235.000000 668.000000 … 451.0 -17.0 -455.0 -768.0 -992.0 -267.0 -18.0 -7.0 122.0 -8.0
27 -172.000000 482.000000 941.000000 -314.000000 946.000000 163.000000 -264.000000 907.000000 242.000000 764.000000 … 437.0 2.0 -460.0 -773.0 -999.0 -251.0 -24.0 35.0 101.0 -8.0
28 -203.000000 486.000000 917.000000 -314.000000 972.000000 145.000000 -284.000000 932.000000 244.000000 637.000000 … 424.0 19.0 -464.0 -777.0 -1007.0 -235.0 -32.0 69.0 61.0 -43.0
29 -200.000000 491.000000 898.000000 -312.000000 978.000000 123.000000 -276.000000 950.000000 227.000000 661.000000 … 411.0 41.0 -476.0 -776.0 -959.0 -221.0 -41.0 -16.0 14.0 -19.0
… … … … … … … … … … … … … … … … … … … … … …
18197 -132.812384 698.264485 597.541402 -102.000000 967.000000 73.000000 5.000000 1111.000000 -167.000000 150.000000 … -167.0 -611.0 78.0 770.0 -1001.0 -42.0 30.0 93.0 30.0 -1.0
18198 -132.812384 698.264485 597.541402 -106.000000 1130.000000 85.000000 0.000000 901.005394 0.000000 131.000000 … -166.0 -610.0 79.0 770.0 -1016.0 -43.0 24.0 55.0 51.0 -23.0
18199 -132.812384 698.264485 597.541402 0.000000 911.128977 0.000000 61.974363 899.313498 81.650478 142.000000 … -167.0 -611.0 80.0 770.0 -1005.0 -51.0 31.0 -20.0 30.0 -18.0
18200 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 127.000000 … -166.0 -611.0 79.0 770.0 -1000.0 -59.0 38.0 -37.0 24.0 7.0
18201 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 153.000000 … -165.0 -611.0 79.0 770.0 -1002.0 -51.0 25.0 -26.0 -3.0 19.0
18202 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 0.000000 … -164.0 -611.0 79.0 770.0 -996.0 -49.0 11.0 -25.0 -18.0 13.0
18203 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -162.0 -611.0 80.0 770.0 -994.0 -57.0 8.0 -40.0 -14.0 43.0
18204 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -161.0 -612.0 80.0 770.0 -996.0 -57.0 12.0 22.0 -18.0 61.0
18205 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -159.0 -612.0 80.0 771.0 -1001.0 -38.0 14.0 41.0 8.0 105.0
18206 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -158.0 -612.0 80.0 771.0 -1002.0 -31.0 24.0 -21.0 -20.0 107.0
18207 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -158.0 -612.0 79.0 771.0 -999.0 -39.0 13.0 -40.0 -22.0 84.0
18208 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -158.0 -611.0 79.0 772.0 -993.0 -42.0 12.0 -23.0 -8.0 91.0
18209 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -158.0 -611.0 79.0 772.0 -994.0 -43.0 13.0 -16.0 -22.0 40.0
18210 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -158.0 -611.0 78.0 772.0 -989.0 -40.0 12.0 -1.0 -24.0 31.0
18211 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -158.0 -611.0 78.0 772.0 -998.0 -33.0 1.0 7.0 5.0 31.0
18212 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -157.0 -611.0 78.0 772.0 -1012.0 -31.0 7.0 37.0 30.0 25.0
18213 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -157.0 -610.0 79.0 772.0 -1013.0 -19.0 25.0 59.0 15.0 28.0
18214 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -158.0 -610.0 80.0 772.0 -998.0 -21.0 26.0 7.0 -18.0 60.0
18215 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -158.0 -610.0 79.0 772.0 -994.0 -26.0 21.0 -40.0 -4.0 44.0
18216 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -159.0 -611.0 78.0 772.0 -1002.0 -28.0 5.0 -63.0 -43.0 74.0
18217 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -159.0 -611.0 77.0 772.0 -1004.0 -28.0 11.0 -63.0 -14.0 77.0
18218 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -158.0 -610.0 77.0 772.0 -999.0 -36.0 17.0 -31.0 -18.0 68.0
18219 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -158.0 -610.0 77.0 772.0 -997.0 -28.0 17.0 -27.0 -48.0 64.0
18220 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -157.0 -611.0 78.0 772.0 -997.0 -26.0 7.0 -16.0 -37.0 90.0
18221 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -157.0 -611.0 78.0 772.0 -999.0 -16.0 12.0 -25.0 -20.0 56.0
18222 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -157.0 -611.0 78.0 772.0 -997.0 -27.0 17.0 -50.0 -15.0 32.0
18223 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -156.0 -611.0 78.0 772.0 -991.0 -33.0 19.0 -21.0 -2.0 66.0
18224 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -156.0 -611.0 77.0 772.0 -993.0 -29.0 17.0 -46.0 7.0 42.0
18225 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -156.0 -611.0 75.0 772.0 -1007.0 -26.0 12.0 -48.0 -6.0 43.0
18226 -132.812384 698.264485 597.541402 -307.128462 909.548077 -32.760824 61.974363 899.313498 81.650478 356.638752 … -157.0 -611.0 74.0 772.0 -1005.0 -26.0 16.0 -33.0 -1.0 39.0

18227 rows × 81 columns

3. Normalize the data using min-max and z-score normalization. Plot histograms of feature 9 and 24; compare and comment on the differences before and after normalization. For both features, plot auto-correlation before and after normalizations and report and discuss observations.¶

In [11]:

from sklearn import preprocessing
minMaxScaler = preprocessing.MinMaxScaler()
minMaxScaledDf = pd.DataFrame(minMaxScaler.fit_transform(processed))

zscoreScaler = preprocessing.StandardScaler()
zscoreDf = pd.DataFrame(zscoreScaler.fit_transform(processed))

zscoreDf = pd.DataFrame(preprocessing.scale(processed))

minMaxScaledDf

Out[11]:

0 1 2 3 4 5 6 7 8 9 … 71 72 73 74 75 76 77 78 79 80
0 0.490170 0.392157 0.610702 0.614894 0.568449 0.535752 0.406460 0.531290 0.665351 0.362937 … 0.845426 0.329853 0.306536 0.185073 0.455176 0.399637 0.456195 0.500889 0.507312 0.500826
1 0.491510 0.394221 0.610702 0.603546 0.569100 0.537855 0.407718 0.525543 0.657895 0.373702 … 0.845426 0.334152 0.305229 0.184065 0.455176 0.399274 0.454121 0.500566 0.510237 0.495047
2 0.487042 0.398349 0.611343 0.614184 0.552151 0.535226 0.411493 0.523627 0.657456 0.366398 … 0.845426 0.338452 0.302614 0.183056 0.448814 0.397459 0.449456 0.499434 0.503058 0.494221
3 0.482127 0.398349 0.601730 0.605674 0.563233 0.544164 0.416107 0.512133 0.667544 0.383314 … 0.845426 0.343980 0.300000 0.182552 0.449971 0.397822 0.451529 0.497333 0.506514 0.493946
4 0.482127 0.489943 0.607498 0.595035 0.566493 0.538381 0.411493 0.525543 0.672807 0.393310 … 0.845426 0.349509 0.296078 0.181543 0.448814 0.400363 0.456713 0.499111 0.498006 0.495322
5 0.482127 0.398349 0.605896 0.602128 0.571056 0.528391 0.408977 0.529374 0.666667 0.406382 … 0.845426 0.355651 0.292810 0.180535 0.447079 0.398548 0.458269 0.499758 0.506248 0.496698
6 0.483021 0.397317 0.613906 0.604965 0.568449 0.537329 0.415688 0.509579 0.660526 0.429450 … 0.845426 0.362408 0.287582 0.180030 0.443609 0.396370 0.453084 0.501212 0.505185 0.495047
7 0.487042 0.395253 0.606857 0.590780 0.561278 0.548370 0.405201 0.516603 0.651754 0.419454 … 0.845426 0.369779 0.283660 0.179526 0.455176 0.395644 0.448419 0.499111 0.503589 0.503027
8 0.485702 0.398349 0.607498 0.602837 0.559974 0.536803 0.416527 0.530013 0.649123 0.434064 … 0.844900 0.377150 0.279739 0.178517 0.448236 0.398548 0.452048 0.500081 0.495879 0.498349
9 0.488829 0.392501 0.609740 0.597163 0.573012 0.542061 0.412752 0.538953 0.642105 0.447905 … 0.844374 0.383907 0.274510 0.178013 0.449971 0.399637 0.447900 0.500889 0.499335 0.498349
10 0.490170 0.391813 0.605575 0.596454 0.559322 0.531546 0.413591 0.545338 0.642982 0.459823 … 0.843323 0.391892 0.270588 0.176500 0.456333 0.400726 0.454121 0.500404 0.501462 0.506879
11 0.483467 0.391469 0.599487 0.591489 0.556714 0.544164 0.411493 0.538953 0.637281 0.472895 … 0.841220 0.399263 0.266667 0.173979 0.456333 0.397459 0.455677 0.497333 0.500399 0.510182
12 0.484361 0.401789 0.606536 0.593617 0.564537 0.543638 0.411074 0.532567 0.649123 0.508651 … 0.838591 0.406020 0.262092 0.171962 0.450549 0.395644 0.452048 0.497980 0.494549 0.519263
13 0.486148 0.403509 0.604614 0.574468 0.563885 0.541535 0.395973 0.531290 0.631579 0.500577 … 0.834911 0.412776 0.258824 0.167927 0.454020 0.397459 0.452048 0.499434 0.498538 0.522014
14 0.476318 0.401789 0.611343 0.560993 0.558670 0.559411 0.394715 0.526181 0.626316 0.520185 … 0.830179 0.418305 0.256209 0.163389 0.463852 0.400363 0.454640 0.493939 0.498006 0.525041
15 0.488382 0.396629 0.606216 0.555319 0.554759 0.550473 0.398909 0.521073 0.614474 0.527874 … 0.824395 0.424447 0.252941 0.158850 0.458068 0.401452 0.447900 0.493454 0.505717 0.533847
16 0.487936 0.396629 0.599808 0.574468 0.557366 0.542061 0.403943 0.554278 0.644298 0.544790 … 0.818086 0.431818 0.249673 0.153303 0.449971 0.407985 0.446864 0.487959 0.520606 0.537424
17 0.490170 0.404885 0.594681 0.544681 0.542373 0.542587 0.405201 0.565134 0.630263 0.567474 … 0.811251 0.437961 0.245752 0.148260 0.447079 0.407260 0.451011 0.490868 0.518213 0.539351
18 0.491510 0.402133 0.607177 0.556738 0.543025 0.552050 0.393037 0.538314 0.654825 0.595540 … 0.803365 0.444103 0.240523 0.143217 0.441296 0.409074 0.453084 0.493292 0.519011 0.518987
19 0.491063 0.401789 0.607818 0.544681 0.558018 0.532072 0.390520 0.544700 0.652632 0.586697 … 0.796004 0.451474 0.235948 0.138174 0.443609 0.410526 0.452048 0.496040 0.522467 0.517887
20 0.490170 0.408669 0.611022 0.549645 0.548892 0.554679 0.393037 0.551086 0.634649 0.604383 … 0.788118 0.458845 0.233987 0.132627 0.449971 0.410526 0.444272 0.499434 0.524329 0.512933
21 0.482574 0.407637 0.606857 0.553901 0.537158 0.553628 0.391779 0.522989 0.629825 0.585544 … 0.780231 0.465602 0.230065 0.128089 0.451128 0.411978 0.432348 0.500081 0.533369 0.514584
22 0.492404 0.405573 0.605575 0.534752 0.558670 0.558360 0.386326 0.533844 0.619737 0.620915 … 0.771293 0.472359 0.225490 0.123046 0.455755 0.416334 0.421462 0.499758 0.545068 0.512658
23 0.489723 0.404197 0.607177 0.534752 0.529987 0.544164 0.385067 0.533844 0.622807 0.654748 … 0.762355 0.480344 0.220261 0.119012 0.457490 0.417786 0.421462 0.494262 0.557564 0.520638
24 0.484808 0.408669 0.603973 0.514894 0.546936 0.551525 0.382970 0.529374 0.621053 0.603614 … 0.753943 0.491400 0.213072 0.116490 0.439560 0.418875 0.439088 0.498788 0.563680 0.509356
25 0.474978 0.414517 0.606536 0.513475 0.541721 0.547844 0.378356 0.517880 0.619298 0.618608 … 0.745531 0.502457 0.209150 0.112456 0.426258 0.411978 0.446864 0.493616 0.540548 0.493946
26 0.483021 0.412453 0.613265 0.512766 0.541069 0.548370 0.375839 0.527458 0.622368 0.627067 … 0.738170 0.513514 0.206536 0.108926 0.433777 0.398548 0.434940 0.504445 0.530178 0.492020
27 0.481680 0.415549 0.611343 0.506383 0.528683 0.569401 0.383809 0.505109 0.625439 0.663975 … 0.730810 0.525184 0.203268 0.106404 0.429728 0.404356 0.431830 0.511233 0.524595 0.492020
28 0.467828 0.416925 0.603653 0.506383 0.545632 0.559937 0.375419 0.521073 0.626316 0.615148 … 0.723975 0.535627 0.200654 0.104387 0.425101 0.410163 0.427683 0.516729 0.513959 0.482389
29 0.469169 0.418645 0.597565 0.507801 0.549544 0.548370 0.378775 0.532567 0.618860 0.624375 … 0.717140 0.549140 0.192810 0.104892 0.452863 0.415245 0.423017 0.502990 0.501462 0.488993
… … … … … … … … … … … … … … … … … … … … … …
18197 0.499190 0.489943 0.501295 0.656738 0.542373 0.522082 0.496644 0.635377 0.446053 0.427912 … 0.413249 0.148649 0.554902 0.884518 0.428571 0.480218 0.459824 0.520608 0.505717 0.493946
18198 0.499190 0.489943 0.501295 0.653901 0.648631 0.528391 0.494547 0.501281 0.519298 0.420607 … 0.413775 0.149263 0.555556 0.884518 0.419896 0.479855 0.456713 0.514466 0.511300 0.487892
18199 0.499190 0.489943 0.501295 0.729078 0.505951 0.483701 0.520543 0.500200 0.555110 0.424837 … 0.413249 0.148649 0.556209 0.884518 0.426258 0.476951 0.460342 0.502344 0.505717 0.489268
18200 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.419070 … 0.413775 0.148649 0.555556 0.884518 0.429150 0.474047 0.463971 0.499596 0.504121 0.496147
18201 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.429066 … 0.414301 0.148649 0.555556 0.884518 0.427993 0.476951 0.457232 0.501374 0.496942 0.499450
18202 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.370242 … 0.414826 0.148649 0.555556 0.884518 0.431463 0.477677 0.449974 0.501535 0.492954 0.497799
18203 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.415878 0.148649 0.556209 0.884518 0.432620 0.474773 0.448419 0.499111 0.494018 0.506054
18204 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.416404 0.148034 0.556209 0.884518 0.431463 0.474773 0.450492 0.509132 0.492954 0.511007
18205 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417455 0.148034 0.556209 0.885023 0.428571 0.481670 0.451529 0.512203 0.499867 0.523115
18206 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417981 0.148034 0.556209 0.885023 0.427993 0.484211 0.456713 0.502182 0.492422 0.523665
18207 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417981 0.148034 0.555556 0.885023 0.429728 0.481307 0.451011 0.499111 0.491890 0.517336
18208 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417981 0.148649 0.555556 0.885527 0.433198 0.480218 0.450492 0.501859 0.495613 0.519263
18209 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417981 0.148649 0.555556 0.885527 0.432620 0.479855 0.451011 0.502990 0.491890 0.505228
18210 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417981 0.148649 0.554902 0.885527 0.435512 0.480944 0.450492 0.505415 0.491359 0.502752
18211 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417981 0.148649 0.554902 0.885527 0.430307 0.483485 0.444790 0.506708 0.499069 0.502752
18212 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.418507 0.148649 0.554902 0.885527 0.422209 0.484211 0.447900 0.511556 0.505717 0.501101
18213 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.418507 0.149263 0.555556 0.885527 0.421631 0.488566 0.457232 0.515112 0.501728 0.501926
18214 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417981 0.149263 0.556209 0.885527 0.430307 0.487840 0.457750 0.506708 0.492954 0.510732
18215 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417981 0.149263 0.555556 0.885527 0.432620 0.486025 0.455158 0.499111 0.496676 0.506329
18216 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417455 0.148649 0.554902 0.885527 0.427993 0.485299 0.446864 0.495394 0.486307 0.514584
18217 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417455 0.148649 0.554248 0.885527 0.426836 0.485299 0.449974 0.495394 0.494018 0.515410
18218 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417981 0.149263 0.554248 0.885527 0.429728 0.482396 0.453084 0.500566 0.492954 0.512933
18219 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.417981 0.149263 0.554248 0.885527 0.430885 0.485299 0.453084 0.501212 0.484977 0.511833
18220 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.418507 0.148649 0.554902 0.885527 0.430885 0.486025 0.447900 0.502990 0.487902 0.518987
18221 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.418507 0.148649 0.554902 0.885527 0.429728 0.489655 0.450492 0.501535 0.492422 0.509631
18222 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.418507 0.148649 0.554902 0.885527 0.430885 0.485662 0.453084 0.497495 0.493752 0.503027
18223 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.419033 0.148649 0.554902 0.885527 0.434355 0.483485 0.454121 0.502182 0.497208 0.512383
18224 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.419033 0.148649 0.554248 0.885527 0.433198 0.484936 0.453084 0.498141 0.499601 0.505779
18225 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.419033 0.148649 0.552941 0.885527 0.425101 0.486025 0.450492 0.497818 0.496145 0.506054
18226 0.499190 0.489943 0.501295 0.511256 0.504921 0.466477 0.520543 0.500200 0.555110 0.507358 … 0.418507 0.148649 0.552288 0.885527 0.426258 0.486025 0.452566 0.500242 0.497474 0.504953

18227 rows × 81 columns

In [12]:

import matplotlib.pyplot as plt

for featureInd in (9,24):
# featureInd = 9
processed.ix[:, featureInd – 1].hist()
plt.title(‘Feature %d histogram before normalization’ % featureInd)
plt.show()

plt.figure()
minMaxScaledDf.ix[:, featureInd – 1].hist()
plt.title(‘Feature %d histogram after min-max normalization’ % featureInd)
plt.show()

plt.figure()
zscoreDf.ix[:, featureInd – 1].hist()
plt.title(‘Feature %d histogram after z-score normalization’ % featureInd)
plt.show()

We can see that min-max normalization makes the value the range becomes 0-1, and the z-score normalization makes the mean become 0 and the standard deviation become 1.¶

In [13]:

from pandas.tools.plotting import autocorrelation_plot
for featureInd in (9,24):
autocorrelation_plot(processed.ix[:, featureInd-1])
plt.title(‘Feature %d auto-correlation plot before normalization’ % featureInd)
plt.show()

autocorrelation_plot( minMaxScaledDf.ix[:, featureInd – 1])
plt.title(‘Feature %d auto-correlation plot after min-max normalization’ % featureInd)
plt.show()

autocorrelation_plot( zscoreDf.ix[:, featureInd – 1])
plt.title(‘Feature %d auto-correlation plot after z-score normalization’ % featureInd)
plt.show()

We can see from the plot that feature 24 has more randomness than feature 9, because its values nearer to 0.¶

II. Feature Extraction (for dataset B)¶

1. Use PCA as a dimensionality reduction technique to the data, compute the eigenvectors and eigenvalues.¶

read the data¶

In [14]:

arr = sio.loadmat(‘DataB.mat’)
feaDf = pd.DataFrame(arr[‘fea’]);
gndDf = pd.DataFrame(arr[‘gnd’]);

1. Use PCA as a dimensionality reduction technique to the data, compute the eigenvectors and eigenvalues.¶

In [15]:

from sklearn.decomposition import PCA
pca = PCA()
trans = pca.fit(feaDf).transform(feaDf)
trans = pd.DataFrame(trans)
trans

Out[15]:

0 1 2 3 4 5 6 7 8 9 … 774 775 776 777 778 779 780 781 782 783
0 -1069.166304 -513.973184 -139.243261 878.387704 387.873484 -335.304982 -189.857033 -312.580046 -111.445704 -125.322693 … -1.866970e+00 -0.084891 0.020108 -1.687018 0.576924 -0.338576 -0.041519 -0.533611 -0.683888 -0.238770
1 -1099.176077 -570.842223 -67.311779 839.381070 345.573249 -530.737220 -516.056930 -73.720660 -9.525282 -329.657997 … -6.898612e-01 0.453774 0.425908 0.357020 0.328161 1.778184 -0.232896 0.092560 -1.472755 0.647717
2 -673.201385 -167.377150 480.988638 83.823068 1036.833666 76.531663 -184.553283 -311.406005 122.135438 444.768606 … -1.616355e+00 0.075662 1.281379 -0.806874 -0.852442 1.466010 0.901360 0.563870 0.318705 1.407347
3 -1010.903339 -187.044145 506.352247 426.446929 901.897549 73.661148 -316.674873 -617.909186 217.415113 135.694758 … -1.582237e-01 -1.132823 -1.522555 -0.132399 0.735122 -1.769569 -0.331854 -0.490687 0.869740 0.179703
4 -1692.970822 -633.369398 -521.943052 367.356716 -6.919257 -601.851221 -515.146878 325.773643 -262.282233 321.254330 … 2.484220e-01 -0.019291 0.059462 -1.723846 0.548581 -0.418587 0.177183 -0.689386 -0.309217 -0.016248
5 -1341.694310 -536.770010 -578.489504 246.425866 -577.933576 140.191582 -698.380687 -144.262094 -486.957581 314.085981 … 3.222658e-01 -2.201212 -0.003722 0.368970 0.345119 1.186437 -0.132119 0.063865 -0.541003 1.614036
6 -1217.832623 -521.312900 -116.355397 240.552050 680.648563 -501.999433 -913.376790 18.253232 -33.072870 98.394066 … 5.214346e-01 1.008699 0.564041 0.085898 -0.202696 -0.671090 0.122620 -0.738915 -0.000874 0.145653
7 -226.761138 -457.324861 -284.321899 128.615747 -1017.001492 -87.023369 385.875869 32.041779 183.274818 -29.290437 … 1.047898e-01 -0.174483 1.777861 -0.356244 -0.498399 0.475913 -0.368573 -0.347439 -0.579300 -0.607482
8 -1219.222302 -479.274443 8.018715 -521.169429 722.007449 195.983302 -449.664821 -658.803486 -78.579125 555.612551 … -7.058616e-01 0.521618 0.213487 -1.125360 -0.158788 -0.212024 -0.549318 0.846626 -0.427703 0.233331
9 -900.753262 -548.126694 30.258637 46.963719 761.675828 -435.396163 -319.884310 -405.824072 225.253721 406.694260 … -7.490606e-02 -0.272587 1.063569 -1.476583 0.757261 1.472265 -0.302958 0.973383 0.662131 1.881623
10 -1119.444673 -430.540486 -4.091402 150.580695 1045.936651 -171.263259 -6.094476 -381.340636 16.295049 487.386055 … -4.710084e-01 -0.239910 1.340985 -0.046804 -0.900036 -1.693041 -1.518728 -0.795529 -1.255490 0.255158
11 -1132.061141 -688.199951 -534.771525 -8.874299 -914.504627 85.645282 -83.685058 -47.217069 -137.197284 248.664230 … 7.205271e-01 0.925305 -0.811268 0.315662 0.271964 -0.632907 0.927693 0.221857 0.990271 0.121421
12 -1283.397863 -396.151025 187.789686 40.171936 787.148009 63.436551 -65.596336 -775.291061 -320.577712 328.230419 … -1.777156e-01 -0.382525 -1.624642 1.565395 0.852080 0.908979 -1.297958 0.301793 1.354985 -0.085096
13 -431.797862 -454.763166 -364.204768 166.199142 -1016.920930 280.651043 59.963980 230.539038 165.544323 -26.678097 … 8.305265e-01 0.604595 0.111757 -0.406335 -0.416129 0.158928 0.705899 0.239840 0.286835 -0.034404
14 -1368.131762 -641.515614 -732.070197 619.218875 -244.318899 -49.976603 -702.191339 207.956919 -256.934176 -546.515302 … -6.548304e-01 0.282450 0.412267 1.090183 0.079337 0.011269 -0.282651 -0.051128 0.802942 0.268547
15 -784.250675 -419.946331 -589.587363 470.526795 -85.742394 -123.237859 23.330085 -537.007303 -126.336548 332.516210 … 1.391425e-01 -0.365961 -1.186444 -0.549488 -0.086847 -0.128104 0.903247 1.348424 -1.642303 0.503692
16 -1455.821769 -519.039116 109.383460 961.047935 647.104030 -564.693429 -462.231542 57.661025 74.874971 -423.976737 … 4.267662e-01 0.742334 0.588549 -0.518502 0.615122 -0.527216 0.562501 0.802088 -0.589751 -0.310724
17 -1406.742684 -1010.360551 -236.296620 302.702769 -1052.810187 99.689728 -45.212342 411.919088 -381.223891 -281.797719 … 1.319060e+00 -0.010289 0.673363 0.252073 0.227497 -1.407226 0.722688 0.169789 -0.250250 -0.530935
18 -809.569532 -355.676642 -553.061747 739.156447 -342.576671 -298.188245 -150.300975 337.326239 -204.026927 -697.361484 … -1.111147e+00 0.314282 0.451578 -0.263198 -0.284767 0.563432 0.453698 -0.228681 1.077237 -0.722921
19 -1009.720480 -500.758620 148.698733 -206.740212 942.652635 179.051437 -228.104703 -743.129840 -68.426814 349.120365 … 4.569068e-01 -0.500428 1.008004 -0.122772 -0.236029 -0.298242 0.531056 0.684323 -0.494935 0.597317
20 -1256.737747 -136.282075 4.767105 401.171633 111.170531 20.126463 -60.401593 -835.939615 -133.735891 108.127751 … -1.958109e-01 0.131444 0.544740 0.151017 0.592460 -0.282234 -0.572113 -0.210930 0.276628 -1.238902
21 -1097.738331 -791.366616 -662.612802 416.147079 -787.374203 -393.630293 -93.053146 802.644413 -187.720560 291.655381 … -6.627231e-01 -0.811848 1.419320 0.121333 -0.286140 -0.397351 0.966862 -1.184955 0.884866 -0.701912
22 -1237.245760 -1011.960330 -384.654272 69.297448 -1141.412452 115.369106 99.890392 534.396101 -155.357462 -128.501289 … -4.123383e-01 0.881303 0.117137 0.500842 0.908992 -0.309411 0.530647 -1.708357 -1.409054 -0.486977
23 -337.006350 -166.084273 -519.665873 130.820076 -403.671160 -213.609312 144.069046 -656.301161 -230.664488 536.355335 … 3.703482e-02 -0.809039 0.710208 -0.477101 -0.461453 0.130055 -0.794128 -0.242733 0.514527 1.470082
24 -622.921863 -177.985457 -414.160463 -401.750020 2.516656 192.630114 -265.215019 -840.935255 82.878453 675.829133 … 5.356892e-02 -0.620902 -0.795742 -0.786351 -1.517372 -0.256620 -0.223543 -1.535969 -0.298616 -1.064983
25 -1058.479796 -679.243567 -567.416761 49.160913 -884.302279 17.978966 -185.622034 235.783387 -370.680238 -13.668915 … 4.232970e-01 -0.161277 1.130549 -0.797970 1.364715 -1.146128 0.305532 -0.714718 0.808577 -1.161876
26 -170.420127 205.716202 -134.305883 -28.439419 90.064382 81.864711 448.946720 -712.161197 -128.578934 -9.634653 … 6.317750e-01 0.744909 -0.761755 1.505177 -0.792133 -0.538672 -0.472992 -0.965063 0.305972 1.582343
27 -937.063555 -120.700099 -159.267905 29.237288 -114.534026 -366.615037 296.334181 -433.097567 -314.105184 178.210137 … 2.932601e-01 0.123222 -0.123823 -0.561385 -0.483566 0.902943 0.240104 -1.721182 -1.308013 0.787594
28 -232.650631 -283.345693 -543.440591 257.837187 -835.285216 -363.820625 320.100815 371.850520 72.632553 116.720545 … 5.947035e-01 0.489138 0.135854 0.377050 -0.088552 -1.708275 -0.113474 -0.242878 -0.449452 -0.756640
29 -899.559652 -780.081817 -483.711259 169.414276 -1077.021039 151.471500 111.863190 468.177092 -249.953559 -95.826977 … -1.674006e+00 -0.119436 -0.854108 -0.077645 1.172576 -0.003654 -0.582229 0.366777 -0.396206 -0.394839
… … … … … … … … … … … … … … … … … … … … … …
2036 224.500021 717.688537 -195.952977 -499.323792 -267.286036 -368.337387 -89.032554 -181.573433 -347.223541 -510.053937 … 6.726383e-02 -0.202020 -0.312123 1.105970 -0.206732 0.108251 -0.842430 0.523495 -0.107648 -0.336346
2037 -435.315514 1060.086518 -252.240653 -784.207808 17.898867 -426.553471 96.977868 456.511082 381.245755 8.698889 … 1.970480e-02 0.112328 -1.068654 -0.055607 -0.149124 0.520174 0.150830 1.034539 0.438916 0.351792
2038 238.402523 821.431429 -296.668932 72.066998 302.033654 -169.441235 256.897351 -41.170547 -210.214010 325.530664 … -9.770797e-01 1.148939 -2.066749 0.544490 -0.854579 -1.099545 0.776323 0.455741 0.945283 -2.146525
2039 479.799398 304.242620 -530.251819 -263.518464 -18.481148 -269.480921 310.804690 -239.886632 -109.217451 -437.991055 … -7.576343e-02 -0.415398 -0.902072 -0.711554 0.828067 -0.049272 -0.621797 -0.561703 -0.781354 -0.586075
2040 -85.398938 899.484460 -359.165698 -478.939958 30.973590 -421.339300 339.919452 323.074595 201.267312 -77.545571 … 1.549309e+00 -1.805956 0.851106 -0.820759 0.520234 1.442292 -0.333977 -0.059679 0.935662 -0.216852
2041 -975.896906 891.969722 -658.015458 -273.613739 -529.811171 -94.804905 -506.695143 702.175432 632.401578 -159.268337 … 1.413062e+00 1.833163 0.717416 0.773201 -1.264493 1.683655 0.559883 -1.457644 -0.459947 -0.372735
2042 -626.495241 838.788408 -639.081008 212.654848 -340.914705 80.270632 142.579529 303.414827 151.562446 -281.652763 … 4.894657e-01 0.865822 0.359183 -0.426559 -0.453866 0.468343 0.161468 2.170382 0.645651 0.817388
2043 258.028617 579.227378 -335.781308 -127.985693 149.696587 -147.257143 -44.205310 -408.000293 -113.630554 -386.437055 … -5.367552e-01 -0.552712 0.880173 0.040852 0.800064 0.183633 -0.217447 -0.116964 0.737840 -0.348819
2044 252.991333 361.106673 -45.994668 -275.875656 123.818071 46.475332 -175.237808 -95.136679 500.235924 -264.320536 … 1.005870e-01 -2.250669 0.806874 -1.290289 0.086770 1.047588 -0.848870 1.283855 1.267557 1.054427
2045 -43.933473 596.114295 -360.973687 -150.061260 152.710758 44.882070 -320.612956 -406.928686 60.535033 -410.485002 … -9.133447e-01 -0.161106 -0.257986 -0.576165 -0.714516 0.778803 -1.023963 -1.503650 0.094508 -0.331217
2046 8.716199 710.432025 -23.709776 -105.468206 -499.729993 -119.021552 141.187623 -197.607646 -66.204958 -517.241369 … 9.633770e-01 -1.157743 -0.772906 -1.253108 2.021171 -1.480527 -1.078181 0.298750 0.205029 0.034573
2047 432.475888 548.321160 -174.616090 99.150459 -3.700452 -295.330003 -290.907181 -243.868960 -344.828150 43.844256 … 4.301049e-01 0.841631 2.026202 -0.539848 1.312115 -0.349677 -1.298490 0.301358 -0.927979 0.103058
2048 -164.177255 959.226629 -239.753319 -546.282908 -434.398609 -269.636934 177.483358 9.427421 130.772654 -108.527800 … 7.257674e-01 0.135984 1.329049 0.403042 -2.722151 -0.649012 1.308447 0.930628 0.785000 -0.567963
2049 19.806668 944.896452 -408.627564 -141.635179 282.084948 -27.051497 315.504355 -208.362093 -476.937131 -10.460458 … -9.347001e-01 0.067785 1.434866 0.927551 0.308521 1.481449 0.374957 -0.488117 -0.770377 0.061573
2050 260.650271 726.328779 -374.089981 326.390047 -251.397107 -172.193151 183.226086 -97.919999 -167.813291 256.801816 … 3.585923e-01 0.430714 -1.042078 0.456595 0.547368 -0.264048 -0.083985 0.502801 -0.227676 0.312044
2051 337.640613 568.111744 -358.456523 -207.139564 -262.826142 -492.078627 232.162057 -133.308352 7.725171 -443.847330 … 9.243614e-01 1.042489 1.283304 0.953035 0.530451 -0.932297 1.773926 1.435049 -0.678990 0.364603
2052 111.635685 782.005670 -337.485075 -404.216222 158.940143 -295.307423 130.147134 -42.810337 -651.072148 -255.777814 … 3.471015e-01 0.741243 0.358033 -0.518499 -0.947758 0.838668 1.031081 -0.240294 -0.656091 -0.037359
2053 575.789733 321.964107 193.327055 -4.403837 546.809983 104.206169 -729.361741 -38.931700 190.821083 309.340733 … -8.631813e-01 -0.189686 -0.567396 -0.692963 0.928563 -1.084058 0.752349 1.116688 -0.872614 -1.054835
2054 -56.731706 983.452324 -101.364217 -575.686798 191.904316 -464.280977 -101.488108 -121.022730 -528.247937 -399.415584 … 3.581617e-01 2.287442 0.293284 -0.684196 0.547065 -0.151886 0.388458 -0.212971 0.073012 0.848820
2055 391.142965 529.109716 -360.665581 34.201602 227.983142 -82.175911 -262.703110 -301.288813 77.284174 -387.278903 … -5.823842e-07 0.139710 0.748517 -1.735751 -1.033248 0.042045 -0.536040 -1.827261 0.504767 -0.127187
2056 222.286099 790.011945 -279.594895 -583.068983 -311.732657 -212.255003 241.094219 3.825699 386.429062 -173.589168 … 5.083759e-02 -0.382053 0.831296 0.080013 0.510445 0.422860 -0.379502 -0.065517 0.320467 -1.119466
2057 432.180822 776.933460 49.309403 -306.113951 153.785454 -115.321890 -310.977746 -362.917298 -359.629838 -390.862098 … 9.127987e-01 -0.150100 -1.416913 -1.180253 -0.216539 -1.985620 -0.154041 -0.090064 0.542056 -0.034879
2058 -141.001824 566.194886 -347.955180 69.169126 -373.710178 241.262334 167.023778 -555.740141 458.383106 119.538288 … -2.816329e-01 0.386191 -1.443055 0.216804 0.515054 0.060603 0.978413 0.059415 -0.653913 -0.633914
2059 -467.623465 718.782394 -152.832548 -135.848772 344.854140 76.905198 -440.266244 -330.318037 216.443552 -299.199686 … -2.566683e-02 0.130000 -0.893729 -1.067120 0.430176 -0.448090 -0.033895 0.100113 -0.050692 -1.174197
2060 -514.275826 1049.482202 -256.796813 -906.542691 -121.841814 -276.421567 -9.396248 266.499936 40.353150 -444.488705 … -8.330565e-01 -0.300031 0.141213 -0.250310 -0.635696 0.332390 0.351759 0.119501 0.471480 0.221765
2061 24.355662 742.490057 -467.186538 -255.047804 -157.302478 -239.561076 281.447738 -379.925546 4.622007 -6.770294 … -3.704237e-01 1.090093 1.351715 0.539548 -0.619355 -0.829358 -0.392200 1.028506 -1.373923 0.116358
2062 -48.768593 734.458335 -334.353122 -496.721083 117.983630 -467.477372 293.985916 278.772911 206.289920 -273.060815 … 1.748901e+00 -1.171978 1.029203 -0.349139 -0.750915 -0.464229 -0.226356 0.255568 -0.746874 -0.033951
2063 -131.021601 866.607035 -397.861565 -248.089962 45.492451 -93.904547 425.965974 -9.764928 -422.312250 5.656109 … 1.476022e+00 1.406747 -0.057995 -0.510126 0.272517 0.230043 -0.197592 -1.201246 0.144962 -0.737190
2064 262.141229 652.777351 -347.602739 72.427962 -80.070774 -164.182531 -48.367170 -300.056899 -268.817282 -106.874623 … -1.011525e+00 0.900132 -0.416246 -0.330704 0.245400 -0.554359 -0.735933 -0.625400 -1.717483 -0.719203
2065 480.891094 432.743142 18.124027 -364.053056 502.566777 428.276701 -406.951773 -60.115665 373.761359 -233.606997 … -4.350459e-02 0.444487 -0.265254 -0.102027 0.201297 -0.839082 -0.131206 -0.089335 -0.189487 -0.019911

2066 rows × 784 columns

2. Plot a 2 dimensional representation of the data points based on the first and second principal components. Explain the results versus the known classes (display data points of each class with a different color).¶

In [16]:

colors = [‘navy’, ‘turquoise’, ‘darkorange’, ‘red’, ‘green’]
classes = [0, 1, 2, 3, 4]
classesStr = [‘class ‘ + str(i) for i in classes]
for color, i, class_name in zip(colors, classes, classesStr):
t = trans[gndDf.ix[:,0] == i]
plt.scatter(t.ix[:,0], t.ix[:,1], color=color, alpha=.8,
label=class_name)

plt.legend(loc=’best’, shadow=False, scatterpoints=1)
plt.title(‘2 dimensional representation with first and second principal components’)
plt.show()

We can see that the classes have relative clear boundaries in the plot, because the PCA tries to maximize the variance.¶

3. Repeat step 2 for the 5th and 6st components. Comment on the result.¶

In [17]:

for color, i, class_name in zip(colors, classes, classesStr):
t = trans[gndDf.ix[:,0] == i]
plt.scatter(t.ix[:,4], t.ix[:,5], color=color, alpha=.8,
label=class_name)

plt.legend(loc=’best’, shadow=False, scatterpoints=1)
plt.title(‘2 dimensional representation with first and second principal components’)
plt.show()

Unlike the previous plot using the first 2 components, the classes doesn’t have clear separation in the plot. Because the combination of 5th and 6st components account fewer variance than the first 2 components.¶

4. Use the Naive Bayes classifier to classify 8 sets of dimensionality reduced data (using the first 2, 4, 10, 30, 60, 200, 500, and all 784 PCA components). Plot the classification error for the 8 sets against the retained variance (rm from lect3:slide22) of each case.¶

In [18]:

from sklearn.naive_bayes import GaussianNB

class_errors = []
retainedVars = []
for n in [2, 4, 10, 30, 60, 200, 500,784]:
gnb = GaussianNB()
y_pred = gnb.fit(trans.ix[:,:n], gndDf.ix[:,0]).predict(trans.ix[:,:n])
class_error = sum(y_pred != gndDf.ix[:,0])*1.0 / y_pred.shape[0]
class_errors.append(class_error)
retainedVars.append(sum(pca.explained_variance_[:n]) / sum(pca.explained_variance_))

plt.plot(retainedVars, class_errors)
plt.xlabel(‘retained variance’)
plt.ylabel(‘classication error’)
plt.title(‘Naive Bayes Classification Error’)
plt.show()

5. As the class labels are already known, you can use the Linear Discriminant Analysis (LDA) to reduce the dimensionality, plot the data points using the first 2 LDA components (display data points of each class with a different color). Explain the results obtained in terms of the known classes. Compare with the results obtained by using PCA.¶

In [19]:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
transLda = lda.fit(feaDf, gndDf.ix[:,0]).transform(feaDf)
transLda = pd.DataFrame(transLda)
colors = [‘navy’, ‘turquoise’, ‘darkorange’, ‘red’, ‘green’]
classes = [0, 1, 2, 3, 4]
classesStr = [‘class ‘ + str(i) for i in classes]
for color, i, class_name in zip(colors, classes, classesStr):
t = transLda[gndDf.ix[:,0] == i]
plt.scatter(t.ix[:,0], t.ix[:,1], color=color, alpha=.8,
label=class_name)

plt.legend(loc=’best’, shadow=False, scatterpoints=1)
plt.title(‘2 dimensional representation with the first 2 LDA components’)
plt.show()

The LDA tries to output the components that maximize variance between the known classes, we can see from the plot that the separation of classes is clearer than the one got by PCA which tries to output the components that maximize variance in the whole data.¶

In [ ]: