PART I: Age regression from gray matter masks¶
This part of the coursework is about age regression from gray matter masks which have been extracted from brain MRI scans.
Each voxel in the gray matter masks is one feature, but because the number of voxels is huge, first a dimensionality reduction using PCA needs to be implemented, before the reduced data can be used to train a model for age regression.
Read the descriptions and code carefully and look out for the cells marked with ‘TASK’.¶
The following cell contains helper code to obtain filenames and for reading age information for each subject from a spreadsheet.
In [ ]:
import os
import re
import numpy
import xlrd
import SimpleITK as sitk
# Retrieve the list of patients
data_dir = ‘./data/graymatter’
imageNames = sorted(os.walk(data_dir).next()[2]) # Retrieve all the imagenames
# Read the spreadsheet to retrieve the age information for each subject
ages = []
csvfilename = ‘./data/meta/IXI.xls’
workbook = xlrd.open_workbook(csvfilename)
sheet = workbook.sheet_by_index(0)
idCells = sheet.col_slice(colx=0, start_rowx=1,end_rowx=None)
ageCells = sheet.col_slice(colx=11,start_rowx=1,end_rowx=None)
idAgeDic = dict( (ii.value, ageCells[loopId].value) for loopId,ii in enumerate(idCells))
This cell defines a function for reading gray matter masks and corresponding age labels.
In [ ]:
def readImagesAndLabels (imagenames):
ImgArray = []
LblArray = []
for ImageName in imagenames:
regexp_result = re.search(r’wc1IXI\d+’, ImageName)
subjectId = (int(regexp_result.group().split(‘wc1IXI’)[1]))
LblArray.append(idAgeDic[subjectId])
# Loading the image
fullImageName = data_dir + ‘/’ + ImageName
inImage = sitk.ReadImage(fullImageName)
inArray = sitk.GetArrayFromImage(inImage)
ImgArray.append(inArray.flatten())
# Debug information
if 0:
print ‘subjectName: {0}’.format(ImageName)
print ‘subjectId: {0}’.format(subjectId)
print ‘subjectAge: {0}\n’.format(subjectIdAgeDic[subjectId])
# Create a numpy array – training data
ImgArray=numpy.array(ImgArray,dtype=numpy.uint8) # 4D array – [nSubjects,Zdim x Ydim x Xdim]
LblArray=numpy.array(LblArray,dtype=numpy.float32) # 1D array – [nSubjects]
return ImgArray, LblArray
TASK 1.1: Dimensionality reduction¶
In the next cell you are asked to implement a dimensionality reduction using PCA from sklearn’s decomposition module. The prinicipal components should be learned from the training data, and then used to perform a dimensionality reduction for both training and testing data.
Check out http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
In [ ]:
from sklearn import decomposition
def pcaReduction(trainingData, testingData):
# Perform dimensionality reduction on the images using Principal Component Analysis
# ADD CODE HERE
return trainingData_reduced, testingData_reduced
TASK 1.2: Training a model for age regression¶
In the next cell you are asked to implement a function that takes input data and corresponding labels and trains a regression model. It is up to you to choose a suitable method from the many that are provided in sklearn.
Check out http://scikit-learn.org/stable/supervised_learning.html
In [ ]:
def trainRegressor (data, labels):
# ADD CODE HERE
return model
TASK 1.3: Apply the learned model on new data¶
In the next cell you are asked to implement a function that takes data and a learned regression model as input, applies the model to the data, and returns the predicted labels.
In [ ]:
def applyRegressor (data, model):
# ADD CODE HERE
return labels
The following cell implements an evaluation function that takes an array of true age labels and an array of predicted age labels, and assesses prediction quality by computing mean and root mean square errors. It also can optionally plot the true vs. predicted labels.
In [ ]:
import matplotlib.pyplot as plt
def evaluate(labels_true, labels_predicted, plot=False):
if plot:
%pylab inline
plt.figure(figsize=(6,6))
plt.scatter(labels_true, labels_predicted)
plt.plot([0, 100], [0, 100], ‘–k’, linewidth=3)
plt.axis(‘tight’); plt.xlabel(‘True age’,fontsize=15); plt.ylabel(‘Predicted age’, fontsize=15)
plt.tick_params(axis=’both’, which=’major’, labelsize=15); plt.grid(‘on’); plt.show()
# Age Prediction Errors
prediction_errors = labels_true – labels_predicted
# Mean error
mean_error = numpy.mean(numpy.abs(prediction_errors))
print ‘Mean error is {0}’.format(mean_error)
# Root mean squared error
root_mean_squared_error = numpy.sqrt(numpy.mean(numpy.power(prediction_errors,2)))
print ‘Root mean squared error is {0}’.format(root_mean_squared_error)
return prediction_errors
The next cell prepares the data for a very simple experiment where the images are split half/half into two sets, one for training and one for testing.
In [ ]:
# Preload data and split half/half into training and testing
images, labels = readImagesAndLabels(imageNames)
trainingImages = images[0::2]
trainingLabels = labels[0::2]
testingImages = images[1::2]
testingLabels = labels[1::2]
print ‘Number of training images is {0}’.format(len(trainingImages))
print ‘Number of testing images is {0}’.format(len(testingImages))
TASK 1.4: Simple experiment¶
In the next four cells you are asked to set up and execute a simple experiment using the above training and testing images. You need four steps: 1) dimensionality reduction, 2) train a regressor, 3) apply the regressor on test data, 4) evaluate the prediction quality
In [ ]:
# 1) Dimensionality reduction
# ADD CODE HERE
In [ ]:
# 2) Train a model
# ADD CODE HERE
In [ ]:
# 3) Test the model
# ADD CODE HERE
In [ ]:
# 4) Evaluate predictions
# ADD CODE HERE
TASK 1.5: Cross validation using k-folds¶
In the next cell you are asked to implement a k-fold cross validation such that every subject is used once for testing and prediction errors can be computed for all subjects.
In [ ]:
from sklearn.model_selection import KFold
def kfold_cross_validation(n_folds, imgs, lbls):
kf = KFold(n_splits=n_folds)
predictions = numpy.array([])
for foldId, (trainIds,testIds) in enumerate(kf.split(range(0,len(imgs)))):
print ‘Fold: {0}/{1}’.format(foldId+1,n_folds)
# ADD CODE HERE
predictions = numpy.concatenate((predictions,testingLabels_predicted))
return predictions
The following cell runs a 2-fold cross validation and compute errors for all subjects.
In [ ]:
predictions = kfold_cross_validation(2, images, labels)
errors = evaluate(labels, predictions, True)
TASK 1.6 (optional): Training size vs prediction error¶
In the next cell you are asked to explore prediction errors vs number of training subjects. One possibility to do this is to consecutively increase the size from the image set and use a k-fold cross validation on each set.
In [ ]:
# Preload training and testing data
nImages = len(imageNames)
imageSetSize = np.linspace(0.1,1,5)
plotList_nTrainImages = []
plotList_errors = []
for perc in imageSetSize:
folds = 2;
nImg = int(round(nImages * perc))
nTrainImg = int(round(nImg – nImg / folds))
print ‘Number of training images is {0}’.format(nTrainImg)
# ADD CODE HERE
plotList_nTrainImages.append(nTrainImg)
plotList_errors.append(errors)
In [ ]:
%pylab inline
plt.figure(figsize=(6,4))
plt.plot(plotList_nTrainImages, plotList_errors,’b-‘,marker=’o’, markersize=10)
plt.xlabel(‘Number of training images’, fontsize=15); plt.ylabel(‘Error (age)’, fontsize=15)
plt.tick_params(axis=’both’, which=’major’, labelsize=15); plt.grid(‘on’); plt.show()