R中的随机森林混乱矩阵Caret

我有二进制YES/NO Class响应的数据。使用以下代码来运行RF模型。我在获取混淆矩阵结果时遇到了问题。R中的随机森林混乱矩阵Caret

dataR <- read_excel("*:/*.xlsx") 
Train <- createDataPartition(dataR$Class, p=0.7, list=FALSE) 
training <- dataR[ Train, ] 
testing <- dataR[ -Train, ] 

model_rf <- train( Class~., tuneLength=3, data = training, method = 
"rf", importance=TRUE, trControl = trainControl (method = "cv", number = 
5))

结果：

Random Forest 

3006 samples 
82 predictor 
2 classes: 'NO', 'YES' 

No pre-processing 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 2405, 2406, 2405, 2404, 2404 
Addtional sampling using SMOTE 

Resampling results across tuning parameters: 

mtry Accuracy Kappa  
    2 0.7870921 0.2750655 
    44 0.7787721 0.2419762 
87 0.7767760 0.2524898 

Accuracy was used to select the optimal model using the largest value. 
The final value used for the model was mtry = 2.

到目前为止很好，但是当我运行此代码：

# Apply threshold of 0.50: p_class 
class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO") 

# Create confusion matrix 
p <-confusionMatrix(class_log, testing[["Class"]]) 

##gives the accuracy 
p$overall[1]

我得到这个错误：

Error in model_rf[, 1] : incorrect number of dimensions

我很感激，如果你家伙可以帮助我得到混淆矩阵结果。

来源

2017-10-18 Mike

将'model_rf [，1]'打印到控制台并查看它。 – jsb

如果你在你的问题中包含一个[最小可重现的例子]（https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example），它会更容易帮助你。 – jsb

据我了解，你想获得获得在插入符号交叉验证的混淆矩阵。

为此，您需要在trainControl中指定savePredictions。如果设置为"final"，则保存最佳模型的预测。通过指定classProbs = T每个班级的概率也将被保存。

data(iris) 
iris_2 <- iris[iris$Species != "setosa",] #make a two class problem 
iris_2$Species <- factor(iris_2$Species) #drop levels 

library(caret) 
model_rf <- train(Species~., tuneLength = 3, data = iris_2, method = 
         "rf", importance = TRUE, 
        trControl = trainControl(method = "cv", 
              number = 5, 
              savePredictions = "final", 
              classProbs = T))

预测是在：

model_rf$pred

分类为每CV前方作战点，排序为原始数据帧：

model_rf$pred[order(model_rf$pred$rowIndex),2]

以获得混淆矩阵：

confusionMatrix(model_rf$pred[order(model_rf$pred$rowIndex),2], iris_2$Species) 
#output 
Confusion Matrix and Statistics 

      Reference 
Prediction versicolor virginica 
    versicolor   46   6 
    virginica   4  44 

       Accuracy : 0.9    
       95% CI : (0.8238, 0.951) 
    No Information Rate : 0.5    
    P-Value [Acc > NIR] : <2e-16   

        Kappa : 0.8    
Mcnemar's Test P-Value : 0.7518   

      Sensitivity : 0.9200   
      Specificity : 0.8800   
     Pos Pred Value : 0.8846   
     Neg Pred Value : 0.9167   
      Prevalence : 0.5000   
     Detection Rate : 0.4600   
    Detection Prevalence : 0.5200   
     Balanced Accuracy : 0.9000   

     'Positive' Class : versicolor

在两类设置通常特定因为阈值概率是次优的。通过优化Kappa或Youden的J统计量（或任何其他优选的）作为概率的函数，可以在训练后找到最佳阈值。下面是一个例子：

sapply(1:40/40, function(x){ 
    versicolor <- model_rf$pred[order(model_rf$pred$rowIndex),4] 
    class <- ifelse(versicolor >=x, "versicolor", "virginica") 
    mat <- confusionMatrix(class, iris_2$Species) 
    kappa <- mat$overall[2] 
    res <- data.frame(prob = x, kappa = kappa) 
    return(res) 
})

这里就不在threshold == 0.5但在0.1中获得的最高卡帕。这应该小心使用，因为它可能导致过度贴合。

来源

2017-10-18 20:48:33 missuse

谢谢。只有一个问题，在这个代码中，cm pred模型仅在将train定义为数据集时才有用。我认为对于pred我需要定义测试数据集。当我测试$ Class的代码时，它给出了这个错误：表中的错误（数据，参考，dnn = dnn，...）：所有参数必须具有相同的长度 – Mike

此代码导致交叉验证折叠混乱矩阵。由于交叉验证是在列车上完成的，因此仅适用于列车组。为了获得测试集上的混淆矩阵，必须首先预测测试集样本的类别，并通过'confusionMatrix'函数将其与真实类别进行比较。 – missuse

你可以试试这个产生混淆矩阵和检查精度

m <- table(class_log, testing[["Class"]]) 
m #confusion table 

#Accuracy 
(sum(diag(m)))/nrow(testing)

来源

2017-10-18 17:48:39

谢谢，但运行class_log部分时出错。我编辑我的问题 – Mike

的代码块class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO")是执行以下测试的if-else语句：

In the first column of model_rf , if the number is greater than 0.50, return "YES", else return "NO", and save the results in object class_log .

因此，代码实质上创建基于数字向量的类标签的字符向量，“是”和“否”。

来源

2017-10-18 18:01:11 jsb

您需要将您的模型应用于测试集。

prediction.rf <- predict(model_rf, testing, type = "prob")

然后做class_log <- ifelse(prediction.rf > 0.50, "YES", "NO")

来源

2017-10-18 18:30:30

谢谢。 class_log代码适用于二进制Y/N响应类？ – Mike

'prediction.rf'将会有实际值（注意'type =“prob”'）。你也可以通过'type =“raw”'来立即获取二进制文件，但是这不会让你控制阈值。请参阅'？predict.train' –

R中的随机森林混乱矩阵Caret

回答

相关问题