TextMining + Random Forest + GLMNET
TextMining + Random Forest + GLMNET
Yining Zhou
2/10/2018
1. Preparing Dataset
train <- read.csv('trainData.csv', header = TRUE)
test <- read.csv('testData.csv', header = TRUE)
Other_FC <- read.csv('OOS_OFC.csv', header = TRUE)
data <- rbind(train, test)
a) Include OOS_Other_FC into the original dataset
b) Replace string by Others if occurrence < 20
c)Just to see the data
No Yes
2040 236
[1] 0.8963093
2. Text Mining
a) Create corpus
<
Metadata: 7
Content: chars: 104
<
Metadata: 7
Content: chars: 83
b) Create a bag of words matrix
<
Non-/sparse entries: 23951/4341417
Sparsity : 99%
Maximal term length: 15
Weighting : term frequency (tf)
dog bag food cat treat
1479 1048 1040 753 659
[1] 746
1
<
Non-/sparse entries: 10870/57410
Sparsity : 84%
Maximal term length: 8
Weighting : term frequency (tf)
c) Convert the matrix to a new data frame
new.data <- as.data.frame(as.matrix(dtm)) new.data$ID <- data$ID new.data <- new.data[,c(31,1:30)] #new.data$OOS_within_30 <- as.factor(data$OOS_within_30) d) Merge the dataset after text mining with the original dataset # Merge two dataset new.data <- merge(data, new.data, by = 'ID') e) Split data into train and test library(caTools) set.seed(1234) spl <- sample.split(new.data$OOS_within_30, 0.85) new.train <- subset(new.data, spl == TRUE) new.test <- subset(new.data, spl == FALSE) 3. Try KKNN library(kknn) m.kknn <- train.kknn(OOS_within_30 ~ . -ID -Name, data = new.train) m.kknn Call: train.kknn(formula = OOS_within_30 ~ . - ID - Name, data = new.train) Type of response variable: nominal Minimal misclassification: 0.1095607 Best kernel: optimal Best k: 11 pred.kknn <- predict(m.kknn, newdata = new.test) table(Truth = new.test$OOS_within_30, Predit = pred.kknn) Predit Truth No Yes No 300 6 Yes 33 2 2 mean(new.test$pred.kknn == new.test$OOS_within_30) [1] NaN 4. Try Random Forest Call: randomForest(formula = OOS_within_30 ~ . - ID - Name, data = new.train, importance = TRUE, na.action = na.roughfix) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 6 OOB estimate of error rate: 10.7% Confusion matrix: No Yes class.error No 1720 14 0.008073818 Yes 193 8 0.960199005 MeanDecreaseGini forecast 39.3620141 demand 36.9304956 fill_rate 33.7123432 safety_stock 14.0029996 lead_time_lag 24.1964730 OOS_same_fc 11.3111855 Zero_ships_past_five 12.8111059 product_category_level1 5.3433764 product_category_level2 14.4254023 product_category_level3 19.6526979 product_autoship_save_eligible_flag 5.1297012 OOS_Other_FC 18.6719136 bag 5.6032600 dog 4.8421476 treat 6.2941699 chicken 4.0455230 recip 5.6931636 can 2.3836258 case 1.9029153 food 2.0650081 dri 1.9417905 cat 6.1309877 natur 3.8409330 beef 2.4753771 rice 1.5448059 canin 1.9257915 diet 2.5260253 royal 1.6897281 chew 2.6239081 flavor 2.1651216 count 2.9336475 small 3.2962499 larg 1.4913266 3 toy 1.4337833 litter 1.6308956 formula 3.1717094 grainfre 3.3966532 potato 2.9155368 hill 0.9219165 adult 2.7321387 blue 3.0810492 buffalo 2.2788492 a) See how random forest model doing Predit Truth No Yes No 304 2 Yes 32 3 [1] 0.9002933 5. Try GLMNET (Most Useless One) a) Preparing Data for GLMNET b) Fit model with RR, LA, and EN # Make the results reproducible set.seed(1) # Ridge regression m.ridge <- cv.glmnet(x = x.train, y = y.train, family = 'binomial', alpha = 0) # Lasso regreesion m.lasso <- cv.glmnet(x = x.train, y = y.train, family = 'binomial', alpha = 1) # Elastic net m.elnet <- cv.glmnet(x = x.train, y = y.train, family = 'binomial', alpha = .5) c) Predict with rr model d) See the Accuracy acc [1] 0.884058 0.884058 0.884058 0.884058 0.884058 0.884058 0.884058 [8] 0.884058 0.884058 0.884058 0.115942 0.115942 0.115942 0.115942 [15] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 [22] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 [29] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 [36] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 [43] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 [50] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 [57] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 4 [64] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 [71] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 [78] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 [85] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 [92] 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 0.115942 [99] 0.115942 0.115942 plot(acc) 0 20 40 60 80 100 0 .2 0 .4 0 .6 0 .8 Index a cc 5 1. Preparing Dataset a) Include OOS_Other_FC into the original dataset b) Replace string by Others if occurrence < 20 c)Just to see the data 2. Text Mining a) Create corpus b) Create a bag of words matrix c) Convert the matrix to a new data frame d) Merge the dataset after text mining with the original dataset e) Split data into train and test 3. Try KKNN 4. Try Random Forest a) See how random forest model doing 5. Try GLMNET (Most Useless One) a) Preparing Data for GLMNET b) Fit model with RR, LA, and EN c) Predict with rr model d) See the Accuracy