代写 math matlab statistic Programming MATLAB

Programming MATLAB
Paul Cotofrei
information management institute master of science in finance
2017

Categorical Arrays
􏰁 StartingfromversionR2013,MATLABintroducednewdata types
􏰁 Categorical Arrays
􏰁 Tables
􏰁 Map containers
􏰁 Time Series
􏰁 Date and Time arrays
􏰁 Categoricalarrays:datatypetostoredatawithvaluesfroma
finite set of discrete categories.
􏰁 Thecategoriescanhaveanaturalorder,butitisnotrequired.
􏰁 Providesefficientstorageandconvenientmanipulationof nonnumeric data, while also maintaining meaningful names for the values

Create categorical arrays
􏰁 Bydefault,categoricalarrayscontaincategoriesthathaveno mathematical ordering.
􏰁 Example: the set of categories {’dog’, ’cat’, ’bird’}
􏰁 Butitispossibletocreateordinalcategoricalarraythosecategories
have a meaningful mathematical ordering.
􏰁 For example, the set of size categories {’small’, ’medium’, ’large’}
has the mathematical ordering small < medium < large. 􏰁 Generatecategoricalarrayfromcellarrayofstrings » labels = {’MA’,’ME’,’CT’,’VT’,’ME’,’NH’,’VT’,’MA’,’NH’,’CT’,’RI’}; » state = categorical(labels); » categories(states); ’CT’ ’MA’ ’ME’ ’NH’ ’RI’ ’VT’ % List the discrete categories in the variable state % The categories are listed in alphabetical order. Generate categorical arrays 􏰁 Generateordinalcategoricalarrayfromcellarrayofstrings % 1-by-8 cell array of strings containing the sizes of eight objects » allsizes = {’medium’, ’large’, ’small’, ’small’, ’medium’, ’large’, ’medium’, ’small’}; % cell array specifying the set of categories » valueset = {’small’,’medium’,’large’}; % ordinal categorical array » sizeorder = categorical(allsizes, valueset, ’Ordinal’, true) sizeorder = ’medium’ ’large’ ’small’ ’small’ ’medium’ ’large’ ’medium’ ’small’ » categories(sizeorder); ’small’ ’medium’ ’large’ % for an ordinal categorical array, the first category specified is the smallest and the last category is the largest 􏰁 Bydefault,thenameofacategoryisthenameofthevalue(string),butitis possible to specify the category names » A A= » B B= = {’r’ ’b’ ’g’; ’g’ ’r’ ’b’} ’r’ ’b’ ’g’ ’g’ ’r’ ’b’ = categorical(A, {’b’, ’r’, ’g’}, {’blue’, ’red’, ’green’}) red blue green green red blue » categories(B); ’blue’ ’red’ ’green’ Generate categorical arrays 􏰁 Generatecategoricalarrayfromnumericalarray » A = randi([1, 3], 1, 4) A= 3121 » valueset = 1:3; » catnames = {’un’, ’deux’, ’trois’}; » B = categorical(A, valueset, catnames) B= trois un deux un 􏰁 Generateordinalcategoricalarrayfromnumericalarray » A = randi([1, 3], 1, 5) A= 23121 » valueset = 1:3; » catnames = {’child’, ’adult’, ’senior’}; » B = categorical(A, valueset, catnames, ’Ordinal’, true) B= adult senior child adult child » categories(B); ’child’ ’adult’ ’senior’ Generate categorical array 􏰁 Function discretize(X, edges, ’categorical’, categoryNames): 􏰁 Creates a categorical array where each bin is a category 􏰁 The jth bin contains element X(i) if edges(j) ≤ X(i) < edges(j + 1) for 1 ≤ j < N, where N is the number of bins and length(edges) = N + 1. The last bin contains both edges such that edges(N) ≤ X(i) ≤ edges(N + 1). 􏰁 The category names are set with the strings from cell array categoryNames, those length must be equal to the number of bins. » x = rand(100,1)*50; » catnames = {’small’,’medium’,’large’}; » binnedData = discretize(x, [0 15 35 50], ’categorical’, catnames); » summary(binnedData);% print the number of elements in each category small 30 medium 35 large 35 Access Data Using Categorical Arrays 􏰁 Selectdatabycategory 􏰁 Select elements from particular categories: for categorical arrays, use the logical operators == or ~= to select data that is in, or not in, a particular category. To select data in a particular group of categories, use the ismember function. For ordinal categorical arrays,useinequalities>, >=, <,or<=tofinddataincategories above or below a particular category. 􏰁 Delete data that is in a particular category: use logical operators to include or exclude data from particular categories. 􏰁 Find elements that are not in a defined category: categorical arrays indicate which elements do not belong to a defined category by . Use the isundefined function to find observations without a defined value.

Example
load patients % Load sample data gathered from 100 patients whos
Name
Age
Diastolic
Gender
Height
LastName
Location SelfAssessedHealthStatus Smoker
Systolic
Weight
Size
100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1
Bytes Class
800 double
800 double 12212 cell
800 double 12416 cell 15008 cell 12340 cell
100 logical 800 double 800 double
Attributes
% Create Categorical Arrays from Cell Arrays of Strings
Gender = categorical(Gender);% two genders
Location = categorical(Location);% three locations
% Search for Members of a Single Category
any(Location==’Rampart General Hospital’);% if there are any patients observed at the location, ’Rampart General Hospital’
% Search for Members of a Group of Categories
% logical vector for the patients observed at County General Hospital or VA Hospital
VA_CountyGenIndex = ismember(Location,’County General Hospital’,’VA Hospital’); % select the LastName of the patients observed at either County General Hospital or VA Hospital
VA_CountyGenPatients = LastName(VA_CountyGenIndex);

Example (cont.)
% Select Elements in a Particular Category to Plot
% Use the summary function to print a summary containing the category names and the number of elements in each category
summary(Location)
County General Hospital 39 St. Mary’s Medical Center 24 VA Hospital 37
summary(Gender) Female 53
Male 47 figure()
histogram(Age(Gender==’Female’)) title(’Age of Female Patients’)

Tables
􏰁 Table-anewdatatypesuitableforholdingheterogenous data and metadata.
􏰁 useful for mixed-type tabular data that stored as columns in a text file or in a spreadsheet.
􏰁 convenient containers for collecting and organizing related data variables and for viewing and summarizing data.
􏰁 Tablesconsistofrowsandcolumn-orientedvariables.
􏰁 Eachvariableinatablecanhaveadifferentdatatype,but
must have the same number of rows
􏰁 Typicaluseforatable:storeexperimentaldata,where rows represent different observations and columns represent different measured variables.

Create and View Table
􏰁 Createatablefromworkspacevariables
load patients % Load sample data gathered from 100 patients whos
Name
Age
Diastolic
Gender
Height
LastName
Location SelfAssessedHealthStatus Smoker
Systolic
Weight
Size
100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1 100×1
Bytes Class
800 double
800 double 12212 cell
800 double 12416 cell 15008 cell 12340 cell
100 logical 800 double 800 double
Attributes
􏰁 Tocreateatable,usefunctiontable ans =
T = table(Gender,Smoker,Height,Weight); T(1:5,:)
Gender ________
’Male’ ’Male’ ’Female’ ’Female’ ’Female’
Smoker ______
true false false false false
Height Weight ______ ______
71 176 69 163 64 131 67 133 64 119

Create and View Table
􏰁 Usethefunctionreadtabletoreaddatafromacomma-delimitedfile or a spreadsheet and create a table.
􏰁 readtable reads all the columns that are in a file.
T2 = readtable(’patients.dat’); T2(1:5,:)
􏰁 UseImportTool

Table Manipulation
􏰁 Addanewcolumn:useT.varnamenotation
􏰁 Generate a new column of patients’ ID and add it to table T
T.ID = randi(1e4,100,1); T(1:5,:)
ans =
Gender ________
’Male’ ’Male’ ’Female’ ’Female’ ’Female’
Smoker ______
true false false false false
Height Weight ID ______ ______ ____
71 176 8148 69 163 9058 64 131 1270 67 133 9134 64 119 6324
􏰁 All the variables you assign to a table must have the same number of rows! 􏰁 Viewthedatatype,description,units,andotherdescriptivestatisticsfor
each variable: summary Variables:
Gender: 100×1 cell string
Smoker: 100×1 logical Values:
true 34 false 66
summary(T);
……..

Table Manipulation
􏰁 Createanew,smallertablecontainingthefirstfiverowsofTandthe variables (i.e. columns) from the second to the last
􏰁 We can use numeric indexing within parentheses to specify rows and variables
Tnew = T(1:5, 2:end)
Tnew =
Smoker
______
true
false
false
false
false
Height Weight ID
______ ______ ____
71 176 8148
69 163 9058
64 131 1270
67 133 9134
64 119 6324

Access Data by Row and Variable Names
􏰁 Toaddrownamestoatable:assigntheRowNamesproperty 􏰁 Set the row names using the variable LastNames
T.Properties.RowNames = LastName; T(1:5, 🙂
ans =
Smith Johnson Williams Jones Brown
Gender ________
’Male’ ’Male’ ’Female’ ’Female’ ’Female’
Smoker ______
true false false false false
Height Weight ID ______ ______ ____
71 176 8148 69 163 9058 64 131 1270 67 133 9134 64 119 6324
􏰁 Select all the data for the patients with the last names ’Smith’ and ’Johnson’
􏰁 Select the height and weight of the patient named ’Johnson’ Tnew =
Height Weight ______ ______
Tnew = T({’Smith’,’Johnson’},:)
Tnew = T(’Johnson’,{’Height’,’Weight’})
Johnson 69 163

Add Table Rows
load patients;
T = table(LastName,Gender,Age,Height,Weight,Smoker); size(T)
ans =
100 6
% new table from file
T2 = readtable(’morePatients.txt’); % comma-delimited file with four new patients Tnew = [T; T2];
size(Tnew)
ans =
104 6
% new table from cell array
cellPatients = {’LastName’,’Gender’,’Age’,’Height’,’Weight’,’Smoker’;… ’Edwards’,’Male’,42,70,158,0;’Falk’,’Female’,28,62,125,1};
T2 = cell2table(cellPatients(2:end,:)); % create table from rows 2:end of cell array T2.Properties.VariableNames = cellPatients(1,:); % set column names for table T2 Tnew = [Tnew; T2];
size(Tnew)
ans =
106 6
% new table from structure
structPatients.LastName = ’George’; structPatients.Gender = ’Male’; structPatients.Age = 45; structPatients.Height = 76; structPatients.Weight = 182; structPatients.Smoker = 1;
Tnew = [Tnew; struct2table(structPatients)]; % create table from structPatients size(Tnew)
ans =
107 6

Delete Table Rows
􏰁 Omitduplicaterows
􏰁 Deleterowsbyrownumber
􏰁 Deleterowsbyrowname
􏰁 Searchforrowstodelete
Tnew = unique(Tnew);
Tnew([18,20,21],:) = []; %
delete rows 18, 20 and 21
Tnew.Properties.RowNames = Tnew.LastName;
Tnew(’Smith’,:) = [];
toDelete = Tnew.Age<30; Tnew(toDelete,:) = []; Clean Messy and Missing Data 􏰁 Considerthefollowingcomma-separatedtextfile,messy.csv A,B,C,D,E afe1,3,yes,3,3 egh3,.,no,7,7 dbo8,5,no,5,5 oii4,5,yes,5,5 abk6,563„563,563 oks9,23,yes,23,23 wba3„yes,NaN,14 adw3,22,no,22,22 poj2,-99,yes,-99,-99 bas8,23,no,23,23 gry5,NA,yes,NaN,21 􏰁 Differentmissingdataindicatorsinmessy.csv:Emptystring(”-lines6,8),period(.-line3),NA(line12), NaN (line 8, 12), -99 (line 10). T = readtable(’messy.csv’, ’TreatAsEmpty’,{’.’,’NA’}) % create the table and specify the strings to be treated as empty values T= ABCDE ______ ’afe1’ ’egh3’ ’dbo8’ ’oii4’ ’abk6’ ’oks9’ ’wba3’ ’adw3’ ’poj2’ ’bas8’ ’gry5’ ____ _____ 3 ’yes’ NaN ’no’ 5 ’no’ 5 ’yes’ 563 ’’ 23 ’yes’ NaN ’yes’ 22 ’no’ -99 ’yes’ 23 ’no’ NaN ’yes’ ____ ____ 3 3 7 7 5 5 5 5 563 563 23 23 NaN 14 22 22 -99 -99 23 23 NaN 21 Clean Messy and Missing Data 􏰁 Thestrings’.’and’NA’foundinnumericalcolumnswerereplacedwith’NaN’ (Not a Number) 􏰁 TreatAsEmptyonlyappliestonumericcolumnsinthefileandcannothandle numeric literals, such as ’-99’ 􏰁 Displaythesubsetofrowsfromthetable,T,thathaveatleastonemissingvalue. ans = ABCDE TF = ismissing(T,{” ’.’ ’NA’ NaN -99}); T(any(TF,2),:) % the logical function any() is applied on the rows of TF ______ ’egh3’ ’abk6’ ’wba3’ ’poj2’ ’gry5’ ___ _____ NaN ’no’ 563 ’’ NaN ’yes’ -99 ’yes’ NaN ’yes’ ___ ___ 7 7 563 563 NaN 14 -99 -99 NaN 21 􏰁 Replacemissingvalueindicators(-99)byNaN 􏰁 Createanewtable,T2,thatcontainsonlythecompleterows—thosewithout missing data. T = standardizeMissing(T,-99) TF = ismissing(T); T2 = T(~any(TF,2),:) Add and Delete Table Variables load patients T = table(Age,Gender,Smoker); T1 = table(Height,Weight,Systolic,Diastolic); T = [T T1]; % add variables to the table T by horizontally concatenating it with T1 T(1:2,:) ans = Age Gender ___ ________ 38 ’Male’ 43 ’Male’ Smoker ______ true false Height Weight ______ ______ 71 176 69 163 Systolic Diastolic ________ _________ 124 93 109 77 T.BloodPressure = [T.Systolic T.Diastolic]; % create a new variable for blood pressure as a horizontal concatenation of the two variables Systolic and Diastolic T(:,{’Systolic’,’Diastolic’}) = [];% delete variables by name T(1:2, :) ans = Age Gender ___ ________ 38 ’Male’ 43 ’Male’ Smoker ______ true false Height Weight ______ ______ 71 176 69 163 BloodPressure _____________ 124 93 109 77 T.BMI = (T.Weight*0.453592)./(T.Height*0.0254).^2; % Add a new variable, BMI, in the table, T, to contain the body mass index for each patient. BMI is a function of height and weight. T(1:2,:) ans = Age Gender ___ ________ 38 ’Male’ 43 ’Male’ Smoker ______ true false Height Weight ______ ______ 71 176 69 163 BloodPressure BMI _____________ ______ 124 93 24.547 109 77 24.071 Grouping Variables To Split Data 􏰁 Split-Apply-Combineworkflow:splitdataintogroups,applya function to each group, and combine the results 􏰁 Tosplitdatavariablesintogroups,usegroupingvariables 􏰁 GroupingVariables:variablesusedtogroup,orcategorizevaluesin other variables. 􏰁 data variables : the variables that contain observations 􏰁 a grouping variable must have a value corresponding to each value in the data variables 􏰁 data values belong to the same group when the corresponding values in the grouping variable are the same The Split-Apply-Combine Workflow 􏰁 Step1.Selectgroupingvariables 􏰁 Step2.Splitdatavariablesintogroups 􏰁 Step3.Applyfunctionstothegroups 􏰁 Step4.Combinetheresults 􏰁 findgroupsfunctionreturnsavectorofgroupnumbersthatdefine groups based on the unique values in the grouping variables 􏰁 splitapplyfunctionusesthegroupnumberstosplitthedatainto groups efficiently before applying a function. Example load patients % Convert Gender and SelfAssessedHealthStatus to categorical arrays Gender = categorical(Gender); SelfAssessedHealthStatus = categorical(SelfAssessedHealthStatus); % Split the patients into nonsmokers and smokers using the Smoker variable. Calculate the mean weight for each group. [G,smoker] = findgroups(Smoker); meanWeight = splitapply(@mean,Weight,G) meanWeight = 149.9091 161.9412 % Split the patient weights by both gender and status as a smoker and calculate the mean weights G = findgroups(Gender,Smoker); meanWeight = meanWeight = 130.3250 130.9231 180.0385 181.1429 splitapply(@mean,Weight,G) % Summarize the four groups and their mean weights in a table [G,gender,smoker] = findgroups(Gender,Smoker); T = T= table(gender,smoker,meanWeight) gender smoker meanWeight ----- ------ --------- Female false Female true Male false Male true 130.32 130.92 180.04 181.14 Example % Calculate body mass index (BMI) for the four groups of patients. % Define a function that takes Height and Weight as its two input arguments, and that calculates BMI. meanBMIfcn = @(h,w)mean((w ./ (h.^2)) * 703); BMI = splitapply(meanBMIfcn, Height, Weight, G) BMI = 21.6721 21.6686 26.5775 26.4584 % Calculate the fraction of patients who report their health as either Poor or Fair. First, use splitapply to count the number of patients in each group. Then, count only those patients who report their health as either Poor or Fair, using logical indexing on S and G. From these two sets of counts, calculate the fraction for each group. [G,gender,smoker] = findgroups(Gender,Smoker); S = SelfAssessedHealthStatus; I = ismember(S,{’Poor’,’Fair’}); numPatients = splitapply(@numel,S,G); numPF = splitapply(@numel,S(I),G(I)); numPF./numPatients ans = 0.2500 0.3846 0.3077 0.1429

Related Posts