Programming MATLAB
Paul Cotofrei
information management institute master of science in finance
2017
Categorical Arrays
StartingfromversionR2013,MATLABintroducednewdata types
Categorical Arrays
Tables
Map containers
Time Series
Date and Time arrays
Categoricalarrays:datatypetostoredatawithvaluesfroma
finite set of discrete categories.
Thecategoriescanhaveanaturalorder,butitisnotrequired.
Providesefficientstorageandconvenientmanipulationof nonnumeric data, while also maintaining meaningful names for the values
Create categorical arrays
Bydefault,categoricalarrayscontaincategoriesthathaveno mathematical ordering.
Example: the set of categories {’dog’, ’cat’, ’bird’}
Butitispossibletocreateordinalcategoricalarraythosecategories
have a meaningful mathematical ordering.
For example, the set of size categories {’small’, ’medium’, ’large’}
has the mathematical ordering small < medium < large. Generatecategoricalarrayfromcellarrayofstrings
» labels = {’MA’,’ME’,’CT’,’VT’,’ME’,’NH’,’VT’,’MA’,’NH’,’CT’,’RI’}; » state = categorical(labels);
» categories(states); ’CT’
’MA’ ’ME’ ’NH’ ’RI’ ’VT’
% List the discrete categories in the variable state % The categories are listed in alphabetical order.
Generate categorical arrays
Generateordinalcategoricalarrayfromcellarrayofstrings
% 1-by-8 cell array of strings containing the sizes of eight objects
» allsizes = {’medium’, ’large’, ’small’, ’small’, ’medium’, ’large’, ’medium’, ’small’};
% cell array specifying the set of categories
» valueset = {’small’,’medium’,’large’};
% ordinal categorical array
» sizeorder = categorical(allsizes, valueset, ’Ordinal’, true) sizeorder =
’medium’ ’large’ ’small’ ’small’ ’medium’ ’large’ ’medium’ ’small’ » categories(sizeorder);
’small’ ’medium’ ’large’
% for an ordinal categorical array, the first category specified is the smallest and the last category is the largest
Bydefault,thenameofacategoryisthenameofthevalue(string),butitis possible to specify the category names
» A A=
» B B=
= {’r’ ’b’ ’g’; ’g’ ’r’ ’b’}
’r’ ’b’ ’g’
’g’ ’r’ ’b’
= categorical(A, {’b’, ’r’, ’g’}, {’blue’, ’red’, ’green’})
red blue green
green red blue » categories(B);
’blue’ ’red’ ’green’
Generate categorical arrays
Generatecategoricalarrayfromnumericalarray
» A = randi([1, 3], 1, 4) A=
3121
» valueset = 1:3;
» catnames = {’un’, ’deux’, ’trois’};
» B = categorical(A, valueset, catnames) B=
trois un deux un
Generateordinalcategoricalarrayfromnumericalarray
» A = randi([1, 3], 1, 5) A=
23121
» valueset = 1:3;
» catnames = {’child’, ’adult’, ’senior’};
» B = categorical(A, valueset, catnames, ’Ordinal’, true) B=
adult senior child adult child » categories(B);
’child’ ’adult’ ’senior’
Generate categorical array
Function discretize(X, edges, ’categorical’, categoryNames):
Creates a categorical array where each bin is a category
The jth bin contains element X(i) if edges(j) ≤ X(i) < edges(j + 1) for 1 ≤ j < N, where N is the number of bins and length(edges) = N + 1.
The last bin contains both edges such that edges(N) ≤ X(i) ≤
edges(N + 1).
The category names are set with the strings from cell array
categoryNames, those length must be equal to the number of bins.
» x = rand(100,1)*50;
» catnames = {’small’,’medium’,’large’};
» binnedData = discretize(x, [0 15 35 50], ’categorical’,
catnames);
» summary(binnedData);% print the number of elements in
each category
small 30
medium 35
large 35
Access Data Using Categorical Arrays
Selectdatabycategory
Select elements from particular categories: for categorical arrays, use the logical operators == or ~= to select data that is in, or not in, a particular category. To select data in a particular group of categories, use the ismember function. For ordinal categorical arrays,useinequalities>, >=, <,or<=tofinddataincategories above or below a particular category.
Delete data that is in a particular category: use logical operators to include or exclude data from particular categories.
Find elements that are not in a defined category: categorical arrays indicate which elements do not belong to a defined category by
Example
load patients % Load sample data gathered from 100 patients whos
Name
Age
Diastolic
Gender
Height
LastName
Location SelfAssessedHealthStatus Smoker
Systolic
Weight
Size
100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1
Bytes Class
800 double
800 double 12212 cell
800 double 12416 cell 15008 cell 12340 cell
100 logical 800 double 800 double
Attributes
% Create Categorical Arrays from Cell Arrays of Strings
Gender = categorical(Gender);% two genders
Location = categorical(Location);% three locations
% Search for Members of a Single Category
any(Location==’Rampart General Hospital’);% if there are any patients observed at the location, ’Rampart General Hospital’
% Search for Members of a Group of Categories
% logical vector for the patients observed at County General Hospital or VA Hospital
VA_CountyGenIndex = ismember(Location,’County General Hospital’,’VA Hospital’); % select the LastName of the patients observed at either County General Hospital or VA Hospital
VA_CountyGenPatients = LastName(VA_CountyGenIndex);
Example (cont.)
% Select Elements in a Particular Category to Plot
% Use the summary function to print a summary containing the category names and the number of elements in each category
summary(Location)
County General Hospital 39 St. Mary’s Medical Center 24 VA Hospital 37
summary(Gender) Female 53
Male 47 figure()
histogram(Age(Gender==’Female’)) title(’Age of Female Patients’)
Tables
Table-anewdatatypesuitableforholdingheterogenous data and metadata.
useful for mixed-type tabular data that stored as columns in a text file or in a spreadsheet.
convenient containers for collecting and organizing related data variables and for viewing and summarizing data.
Tablesconsistofrowsandcolumn-orientedvariables.
Eachvariableinatablecanhaveadifferentdatatype,but
must have the same number of rows
Typicaluseforatable:storeexperimentaldata,where rows represent different observations and columns represent different measured variables.
Create and View Table
Createatablefromworkspacevariables
load patients % Load sample data gathered from 100 patients whos
Name
Age
Diastolic
Gender
Height
LastName
Location SelfAssessedHealthStatus Smoker
Systolic
Weight
Size
100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1
Bytes Class
800 double
800 double 12212 cell
800 double 12416 cell 15008 cell 12340 cell
100 logical 800 double 800 double
Attributes
Tocreateatable,usefunctiontable ans =
T = table(Gender,Smoker,Height,Weight); T(1:5,:)
Gender ________
’Male’ ’Male’ ’Female’ ’Female’ ’Female’
Smoker ______
true false false false false
Height Weight ______ ______
71 176 69 163 64 131 67 133 64 119
Create and View Table
Usethefunctionreadtabletoreaddatafromacomma-delimitedfile or a spreadsheet and create a table.
readtable reads all the columns that are in a file.
T2 = readtable(’patients.dat’); T2(1:5,:)
UseImportTool
Table Manipulation
Addanewcolumn:useT.varnamenotation
Generate a new column of patients’ ID and add it to table T
T.ID = randi(1e4,100,1); T(1:5,:)
ans =
Gender ________
’Male’ ’Male’ ’Female’ ’Female’ ’Female’
Smoker ______
true false false false false
Height Weight ID ______ ______ ____
71 176 8148 69 163 9058 64 131 1270 67 133 9134 64 119 6324
All the variables you assign to a table must have the same number of rows! Viewthedatatype,description,units,andotherdescriptivestatisticsfor
each variable: summary Variables:
Gender: 100×1 cell string
Smoker: 100×1 logical Values:
true 34 false 66
summary(T);
……..
Table Manipulation
Createanew,smallertablecontainingthefirstfiverowsofTandthe variables (i.e. columns) from the second to the last
We can use numeric indexing within parentheses to specify rows and variables
Tnew = T(1:5, 2:end)
Tnew =
Smoker
______
true
false
false
false
false
Height Weight ID
______ ______ ____
71 176 8148
69 163 9058
64 131 1270
67 133 9134
64 119 6324
Access Data by Row and Variable Names
Toaddrownamestoatable:assigntheRowNamesproperty Set the row names using the variable LastNames
T.Properties.RowNames = LastName; T(1:5, 🙂
ans =
Smith Johnson Williams Jones Brown
Gender ________
’Male’ ’Male’ ’Female’ ’Female’ ’Female’
Smoker ______
true false false false false
Height Weight ID ______ ______ ____
71 176 8148 69 163 9058 64 131 1270 67 133 9134 64 119 6324
Select all the data for the patients with the last names ’Smith’ and ’Johnson’
Select the height and weight of the patient named ’Johnson’ Tnew =
Height Weight ______ ______
Tnew = T({’Smith’,’Johnson’},:)
Tnew = T(’Johnson’,{’Height’,’Weight’})
Johnson 69 163
Add Table Rows
load patients;
T = table(LastName,Gender,Age,Height,Weight,Smoker); size(T)
ans =
100 6
% new table from file
T2 = readtable(’morePatients.txt’); % comma-delimited file with four new patients Tnew = [T; T2];
size(Tnew)
ans =
104 6
% new table from cell array
cellPatients = {’LastName’,’Gender’,’Age’,’Height’,’Weight’,’Smoker’;… ’Edwards’,’Male’,42,70,158,0;’Falk’,’Female’,28,62,125,1};
T2 = cell2table(cellPatients(2:end,:)); % create table from rows 2:end of cell array T2.Properties.VariableNames = cellPatients(1,:); % set column names for table T2 Tnew = [Tnew; T2];
size(Tnew)
ans =
106 6
% new table from structure
structPatients.LastName = ’George’; structPatients.Gender = ’Male’; structPatients.Age = 45; structPatients.Height = 76; structPatients.Weight = 182; structPatients.Smoker = 1;
Tnew = [Tnew; struct2table(structPatients)]; % create table from structPatients size(Tnew)
ans =
107 6
Delete Table Rows
Omitduplicaterows
Deleterowsbyrownumber
Deleterowsbyrowname
Searchforrowstodelete
Tnew = unique(Tnew);
Tnew([18,20,21],:) = []; %
delete rows 18, 20 and 21
Tnew.Properties.RowNames = Tnew.LastName;
Tnew(’Smith’,:) = [];
toDelete = Tnew.Age<30;
Tnew(toDelete,:) = [];
Clean Messy and Missing Data
Considerthefollowingcomma-separatedtextfile,messy.csv
A,B,C,D,E afe1,3,yes,3,3 egh3,.,no,7,7 dbo8,5,no,5,5 oii4,5,yes,5,5 abk6,563„563,563 oks9,23,yes,23,23 wba3„yes,NaN,14 adw3,22,no,22,22 poj2,-99,yes,-99,-99 bas8,23,no,23,23 gry5,NA,yes,NaN,21
Differentmissingdataindicatorsinmessy.csv:Emptystring(”-lines6,8),period(.-line3),NA(line12), NaN (line 8, 12), -99 (line 10).
T = readtable(’messy.csv’, ’TreatAsEmpty’,{’.’,’NA’}) % create the table and specify the strings to be treated as empty values
T=
ABCDE
______ ’afe1’ ’egh3’ ’dbo8’ ’oii4’ ’abk6’ ’oks9’ ’wba3’ ’adw3’ ’poj2’ ’bas8’ ’gry5’
____ _____ 3 ’yes’
NaN ’no’ 5 ’no’ 5 ’yes’
563 ’’
23 ’yes’
NaN ’yes’ 22 ’no’ -99 ’yes’ 23 ’no’ NaN ’yes’
____ ____ 3 3 7 7 5 5 5 5 563 563 23 23 NaN 14 22 22 -99 -99 23 23 NaN 21
Clean Messy and Missing Data
Thestrings’.’and’NA’foundinnumericalcolumnswerereplacedwith’NaN’ (Not a Number)
TreatAsEmptyonlyappliestonumericcolumnsinthefileandcannothandle numeric literals, such as ’-99’
Displaythesubsetofrowsfromthetable,T,thathaveatleastonemissingvalue.
ans =
ABCDE
TF = ismissing(T,{” ’.’ ’NA’ NaN -99});
T(any(TF,2),:) % the logical function any() is applied on the rows of TF
______ ’egh3’ ’abk6’ ’wba3’ ’poj2’ ’gry5’
___ _____ NaN ’no’ 563 ’’ NaN ’yes’ -99 ’yes’ NaN ’yes’
___ ___ 7 7 563 563 NaN 14 -99 -99 NaN 21
Replacemissingvalueindicators(-99)byNaN
Createanewtable,T2,thatcontainsonlythecompleterows—thosewithout missing data.
T = standardizeMissing(T,-99)
TF = ismissing(T); T2 = T(~any(TF,2),:)
Add and Delete Table Variables
load patients
T = table(Age,Gender,Smoker);
T1 = table(Height,Weight,Systolic,Diastolic);
T = [T T1]; % add variables to the table T by horizontally concatenating it with T1
T(1:2,:)
ans =
Age Gender
___ ________ 38 ’Male’ 43 ’Male’
Smoker ______ true false
Height Weight ______ ______ 71 176
69 163
Systolic Diastolic ________ _________ 124 93
109 77
T.BloodPressure = [T.Systolic T.Diastolic]; % create a new variable for blood pressure as a horizontal concatenation of the two variables Systolic and Diastolic
T(:,{’Systolic’,’Diastolic’}) = [];% delete variables by name
T(1:2, 🙂
ans =
Age Gender
___ ________ 38 ’Male’ 43 ’Male’
Smoker ______ true false
Height Weight ______ ______ 71 176
69 163
BloodPressure _____________ 124 93 109 77
T.BMI = (T.Weight*0.453592)./(T.Height*0.0254).^2; % Add a new variable, BMI, in the table, T, to contain the body mass index for each patient. BMI is a function of height and weight.
T(1:2,:)
ans =
Age Gender
___ ________ 38 ’Male’ 43 ’Male’
Smoker ______ true false
Height Weight ______ ______ 71 176
69 163
BloodPressure BMI _____________ ______ 124 93 24.547 109 77 24.071
Grouping Variables To Split Data
Split-Apply-Combineworkflow:splitdataintogroups,applya function to each group, and combine the results
Tosplitdatavariablesintogroups,usegroupingvariables
GroupingVariables:variablesusedtogroup,orcategorizevaluesin other variables.
data variables : the variables that contain observations
a grouping variable must have a value corresponding to each value in the
data variables
data values belong to the same group when the corresponding values in
the grouping variable are the same
The Split-Apply-Combine Workflow
Step1.Selectgroupingvariables
Step2.Splitdatavariablesintogroups
Step3.Applyfunctionstothegroups
Step4.Combinetheresults
findgroupsfunctionreturnsavectorofgroupnumbersthatdefine groups based on the unique values in the grouping variables
splitapplyfunctionusesthegroupnumberstosplitthedatainto groups efficiently before applying a function.
Example
load patients
% Convert Gender and SelfAssessedHealthStatus to categorical arrays
Gender = categorical(Gender);
SelfAssessedHealthStatus = categorical(SelfAssessedHealthStatus);
% Split the patients into nonsmokers and smokers using the Smoker variable. Calculate the mean weight for each group.
[G,smoker] = findgroups(Smoker);
meanWeight = splitapply(@mean,Weight,G)
meanWeight =
149.9091
161.9412
% Split the patient weights by both gender and status as a smoker and calculate the mean weights
G = findgroups(Gender,Smoker);
meanWeight = meanWeight = 130.3250 130.9231 180.0385 181.1429
splitapply(@mean,Weight,G)
% Summarize the four groups and their mean weights in a table
[G,gender,smoker] = findgroups(Gender,Smoker);
T = T=
table(gender,smoker,meanWeight)
gender smoker meanWeight ----- ------ ---------
Female false Female true Male false Male true
130.32 130.92 180.04 181.14
Example
% Calculate body mass index (BMI) for the four groups of patients.
% Define a function that takes Height and Weight as its two input arguments, and that calculates BMI.
meanBMIfcn = @(h,w)mean((w ./ (h.^2)) * 703);
BMI = splitapply(meanBMIfcn, Height, Weight, G)
BMI =
21.6721 21.6686 26.5775 26.4584
% Calculate the fraction of patients who report their health as either Poor
or Fair. First, use splitapply to count the number of patients in each group. Then, count only those patients who report their health as either Poor or Fair, using logical indexing on S and G. From these two sets of counts, calculate the fraction for each group.
[G,gender,smoker] = findgroups(Gender,Smoker);
S = SelfAssessedHealthStatus;
I = ismember(S,{’Poor’,’Fair’});
numPatients = splitapply(@numel,S,G);
numPF = splitapply(@numel,S(I),G(I));
numPF./numPatients
ans =
0.2500 0.3846 0.3077 0.1429