0
我正在尝试通过应用预测建模(max kuhn)一书中的示例。这是创建校准曲线的一个例子。
我有点理解那条曲线的重点,即看实际事件的比例是否与预测事件相似。但我正在努力了解如何计算输出的百分比列。
下面是代码:r - calibration()函数如何计算观察的均匀百分比
library(AppliedPredictiveModeling)
set.seed(975)
simulatedTrain <- quadBoundaryFunc(500)
simulatedTest <- quadBoundaryFunc(1000)
# Random forest
library(randomForest)
rfModel <- randomForest(class ~ X1 + X2,
data = simulatedTrain,
ntree = 2000)
rfTestPred <- predict(rfModel, simulatedTest, type = "prob")
simulatedTest$RFprob <- rfTestPred[,"Class1"]
simulatedTest$RFclass <- predict(rfModel, simulatedTest)
library(caret)
# Calibrating probabilities
calCurve <- calibration(x = class ~ RFprob, data = simulatedTest)
calCurve$data
calibModelVar bin Percent Lower Upper Count midpoint
1 RFprob [0,0.0909] 4.00000 2.203804 6.620306 14 4.545455
2 RFprob (0.0909,0.182] 20.00000 11.648215 30.832609 15 13.636364
3 RFprob (0.182,0.273] 33.33333 20.395974 48.410832 16 22.727273
4 RFprob (0.273,0.364] 37.20930 22.975170 53.274905 16 31.818182
5 RFprob (0.364,0.455] 35.71429 18.640666 55.934969 10 40.909091
6 RFprob (0.455,0.545] 53.19149 38.077789 67.888473 25 50.000000
7 RFprob (0.545,0.636] 65.71429 47.789002 80.867590 23 59.090909
8 RFprob (0.636,0.727] 72.50000 56.111709 85.399101 29 68.181818
9 RFprob (0.727,0.818] 83.33333 67.188407 93.627987 30 77.272727
10 RFprob (0.818,0.909] 95.83333 85.745903 99.491353 46 86.363636
11 RFprob (0.909,1] 94.00000 90.296922 96.603304 235 95.454545
因此,如果我们使用的第一行作为一个例子,什么是Count = 14
说明什么? 据我所见,有14行的RF计算概率介于0-10%(四舍五入)和实际类别之间的差异为Class1
。
nrow(simulatedTest[simulatedTest$RFprob >=0 & simulatedTest$RFprob <=0.0909 & simulatedTest$class == "Class1",])
当我绘制图表
xyplot(calCurve, auto.key = list(columns =2))
在X轴我明白,这是midpoint
柱的bin的中点。并且y轴是Percent
列。 但是如何计算Percent
列?