2016-12-02 84 views
0

我正在构建两个不同的分类器来预测二进制输出。然后我想通过使用ROC曲线和它下面的面积(AUC)比较两个模型的结果。R符号外插样本和测试集ROC

我将数据集分为训练集和测试集。在训练集上,我执行一种交叉验证的形式。从交叉验证的保留样本中,我可以为每个模型建立ROC曲线。然后,我使用测试集上的模型并构建另一组ROC曲线。

结果是矛盾的,这使我感到困惑。我不确定哪个结果是正确的,或者我是否做了完全错误的事情。示例ROC曲线显示RF是更好的模型,训练集ROC曲线显示SVM是更好的模型。

分析

library(ggplot2) 
library(caret) 
library(pROC) 
library(ggthemes) 
library(plyr) 
library(ROCR) 
library(reshape2) 
library(gridExtra) 

my_data <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv") 

str(my_data) 
names(my_data)[1] <- "Class" 
my_data$Class <- ifelse(my_data$Class == 1, "event", "noevent") 

my_data$Class <- factor(emr$Class, levels = c("noevent", "event"), ordered = TRUE) 

set.seed(1732) 
ind <- createDataPartition(my_data$Class, p = 2/3, list = FALSE) 
train <- my_data[ ind,] 
test <- my_data[-ind,] 

接下来我训练两种模式:随机森林和支持向量机。在这里,我还使用Max Kuhns函数从两个模型的保留样本中获取平均ROC曲线,并将这些结果与曲线的AUC一起保存到另一个数据框架中。

#Train RF 
ctrl <- trainControl(method = "repeatedcv", 
       number = 5, 
       repeats = 3, 
       classProbs = TRUE, 
       savePredictions = TRUE, 
       summaryFunction = twoClassSummary) 

grid <- data.frame(mtry = seq(1,3,1)) 

set.seed(1537) 
rf_mod <- train(Class ~ ., 
       data = train, 
       method = "rf", 
       metric = "ROC", 
       tuneGrid = grid, 
       ntree = 1000, 
       trControl = ctrl) 


rfClasses <- predict(rf_mod, test) 

#This is the ROC curve from held out samples. Source is from Max Kuhns 2016 UseR! code here: https://github.com/topepo/useR2016 
roc_train <- function(object, best_only = TRUE, ...) { 


    lvs <- object$modelInfo$levels(object$finalModel) 

    if(best_only) { 
    object$pred <- merge(object$pred, object$bestTune) 
    } 

    ## find tuning parameter names 
    p_names <- as.character(object$modelInfo$parameters$parameter) 
    p_combos <- object$pred[, p_names, drop = FALSE] 

    ## average probabilities across resamples 
    object$pred <- plyr::ddply(.data = object$pred, 
         .variables = c("obs", "rowIndex", p_names), 
         .fun = function(dat, lvls = lvs) { 
          out <- mean(dat[, lvls[1]]) 
          names(out) <- lvls[1] 
          out 
         }) 

    make_roc <- function(x, lvls = lvs, nms = NULL, ...) { 
    out <- pROC::roc(response = x$obs, 
       predictor = x[, lvls[1]], 
       levels = rev(lvls)) 

    out$model_param <- x[1,nms,drop = FALSE] 
    out 
    } 
    out <- plyr::dlply(.data = object$pred, 
       .variables = p_names, 
       .fun = make_roc, 
       lvls = lvs, 
       nms = p_names) 
    if(length(out) == 1) out <- out[[1]] 
    out 
} 

temp <- roc_train(rf_mod) 

plot_data_ROC <- data.frame(Model='Random Forest', sens =  temp$sensitivities, spec=1-temp$specificities) 

#This is the AUC of the held-out samples roc curve for RF 
auc.1 <- abs(sum(diff(1-temp$specificities) *  (head(temp$sensitivities,-1)+tail(temp$sensitivities,-1)))/2) 

#Build SVM 
set.seed(1537) 
svm_mod <- train(Class ~ ., 
       data = train, 
       method = "svmRadial", 
       metric = "ROC", 
       trControl = ctrl) 

svmClasses <- predict(svm_mod, test) 

#ROC curve into df 
temp <- roc_train(svm_mod) 
plot_data_ROC <- rbind(plot_data_ROC, data.frame(Model='Support Vector Machine', sens = temp$sensitivities, spec=1-temp$specificities)) 

#This is the AUC of the held-out samples roc curve for SVM 
auc.2 <- abs(sum(diff(1-temp$specificities) * (head(temp$sensitivities,-1)+tail(temp$sensitivities,-1)))/2) 

接下来,我将绘制结果

#Plotting Final 

#ROC of held-out samples 
q <- ggplot(data=plot_data_ROC, aes(x=spec, y=sens, group = Model, colour =  Model)) 
q <- q + geom_path() + geom_abline(intercept = 0, slope = 1) + xlab("False  Positive Rate (1-Specificity)") + ylab("True Positive Rate (Sensitivity)") 
q + theme(axis.line = element_line(), axis.text=element_text(color='black'), 
     axis.title = element_text(colour = 'black'),  legend.text=element_text(), legend.title=element_text()) 

#ROC of testing set 
rf.probs <- predict(rf_mod, test,type="prob") 
pr <- prediction(rf.probs$event, factor(test$Class, levels = c("noevent", "event"), ordered = TRUE)) 
pe <- performance(pr, "tpr", "fpr") 
roc.data <- data.frame(Model='Random Forest',fpr=unlist([email protected]),  tpr=unlist([email protected])) 

svm.probs <- predict(svm_mod, test,type="prob") 
pr <- prediction(svm.probs$event, factor(test$Class, levels = c("noevent",  "event"), ordered = TRUE)) 
pe <- performance(pr, "tpr", "fpr") 
roc.data <- rbind(roc.data, data.frame(Model='Support Vector  Machine',fpr=unlist([email protected]), tpr=unlist([email protected]))) 

q <- ggplot(data=roc.data, aes(x=fpr, y=tpr, group = Model, colour = Model)) 
q <- q + geom_line() + geom_abline(intercept = 0, slope = 1) + xlab("False  Positive Rate (1-Specificity)") + ylab("True Positive Rate (Sensitivity)") 
q + theme(axis.line = element_line(), axis.text=element_text(color='black'), 
     axis.title = element_text(colour = 'black'),  legend.text=element_text(), legend.title=element_text()) 


#AUC of hold out samples 
data.frame(Rf = auc.1, Svm = auc.2) 

#AUC of testing set. Source is from Max Kuhns 2016 UseR! code here: https://github.com/topepo/useR2016 
test_pred <- data.frame(Class = factor(test$Class, levels = c("noevent",  "event"), ordered = TRUE)) 
test_pred$Rf <- predict(rf_mod, test, type = "prob")[, "event"] 
test_pred$Svm <- predict(svm_mod, test, type = "prob")[, "event"] 

get_auc <- function(pred, ref){ 
    auc(roc(ref, pred, levels = rev(levels(ref)))) 
} 

apply(test_pred[, -1], 2, get_auc, ref = test_pred$Class) 

从持有出样品和测试集是完全不同的结果(我知道他们会有所不同,但是通过这么多?)。

 Rf  Svm 
0.656044 0.5983193 

     Rf  Svm 
0.6326531 0.6453428 

从外挂的样品中选择RF模型,但从测试集中选择SVM模型。

哪一种是选择模型的“正确”或“更好”方式? 我在某个地方犯了一个大错,或者没有正确理解某些东西?

回答

1

如果我理解正确的话,那么你有3个标签的数据集:

  1. 培训
  2. 从训练保持退出CV样本
  3. “测试” CV样本

虽然是,在持有CV样本的情况下,您通常会根据持有样本选择您的模型,但您通常也不会拥有较大的验证数据样本。显然,如果保留和测试数据集都标记为(a)和(b)尽可能接近训练数据的正交性水平,那么您应该根据以样本量较大者为准。

在你的情况下,它看起来像你所说的保持样本只是从训练中重复的CV重采样。情况就是如此,您更有理由更喜欢测试数据集验证的结果。重复CV请参阅Steffen的相关note

理论上,随机森林的袋装具有通过OOB统计的交叉验证的继承形式,并且在训练阶段进行的简历应该给你一些验证的措施。然而,在实践中,通常观察到缺乏正交性并且过度拟合的可能性增加,因为样本来自训练数据本身,并且可能增强过度拟合的准确性的错误。

我可以在某种程度上从理论上解释上述情况,然后除此之外,我只能告诉你,根据经验,我发现从训练数据计算出的所谓CV和OOB误差的性能结果可以是极具误导性并且在培训期间从未触及的保持(测试)数据是更好的验证。

您的真实保留示例测试数据集,因为它的数据在训练过程中均未使用。使用这些结果。

+0

是的我有3个数据集:“培训”,“从培训中拿出样本”和“测试”(我编辑了第二段,因为我在解释这个时犯了一个错误)。我将使用真正的测试集,并摆脱训练集中保留样本制作的ROC曲线。感谢您的答复! – Aerocell