每次你都会得到稍微不同的结果,因为随机性被嵌入到算法中。为了构建每棵树,该算法重新采样数据帧,并随机选择mtry
列以从重新采样的数据帧构建树。如果您想要使用相同参数(例如,mtry,ntree)构建的模型每次都给出相同的结果,则需要设置一个随机种子。
例如,我们运行randomForest
10次,并检查每次运行的均方误差的均值。需要注意的是,平均MSE每一次不同的是:
library(randomForest)
replicate(10, mean(randomForest(mpg ~ ., data=mtcars)$mse))
[1] 5.998530 6.307782 5.791657 6.125588 5.868717 5.845616 5.427208 6.112762 5.777624 6.150021
如果你运行上面的代码,你会得到另外10个值是从上面的值不同。
如果您希望能够重现使用相同参数运行的给定模型(例如,mtry
和ntree
),则可以设置随机种子。例如:
set.seed(5)
mean(randomForest(mpg ~ ., data=mtcars)$mse)
[1] 6.017737
你会得到相同的结果,如果你使用相同的种子值,但在其他不同的结果。使用更大的值ntree
将减少,但不能消除模型运行之间的差异。
更新:当我用您提供的数据样本运行代码时,我并不总是每次都得到相同的结果。即使replace=TRUE
,这导致所述数据帧无需更换被采样,选择的列来构建树中的每个时间可以是不同的:
> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 30%
Confusion matrix:
No Yes class.error
No 3 2 0.4
Yes 1 4 0.2
> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 20%
Confusion matrix:
No Yes class.error
No 4 1 0.2
Yes 1 4 0.2
> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 30%
Confusion matrix:
No Yes class.error
No 3 2 0.4
Yes 1 4 0.2
下面是一组类似的带结果内置iris
数据帧:
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 3.33%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 2 48 0.04
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 4.67%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 4 46 0.08
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)
Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
OOB estimate of error rate: 6%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 6 44 0.12
您还可以看看每个模型运行生成的树,它们通常会有所不同。例如,假设我运行以下代码三次,将结果存储在对象m1
,m2
和m3
中。
randomForest(Churn ~ .,
data = ggg,
ntree = 30,
mtry = 2,
importance = TRUE,
replace = FALSE)
现在我们来看看每个模型对象的前四个树,我已经粘贴在下面。输出是一个列表。你可以看到每个模型运行的第一棵树是不同的。第二个树对于前两个模型运行是相同的,但对于第三个模型运行不同,依此类推。
check.trees = lapply(1:4, function(i) {
lapply(list(m1=m1,m2=m2,m3=m3), function(model) getTree(model, i, labelVar=TRUE))
})
check.trees
[[1]]
[[1]]$m1
left daughter right daughter split var split point status prediction
1 2 3 Partner 1.000000 1 <NA>
2 4 5 gender 1.000000 1 <NA>
3 0 0 <NA> 0.000000 -1 No
4 0 0 <NA> 0.000000 -1 Yes
5 6 7 tenure 2.634489 1 <NA>
6 0 0 <NA> 0.000000 -1 Yes
7 0 0 <NA> 0.000000 -1 No
[[1]]$m2
left daughter right daughter split var split point status prediction
1 2 3 gender 1.000000 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 4 5 tenure 1.850182 1 <NA>
4 0 0 <NA> 0.000000 -1 Yes
5 0 0 <NA> 0.000000 -1 No
[[1]]$m3
left daughter right daughter split var split point status prediction
1 2 3 tenure 2.249904 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 0 0 <NA> 0.000000 -1 No
[[2]]
[[2]]$m1
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No
[[2]]$m2
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No
[[2]]$m3
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 4 5 gender 1 1 <NA>
3 0 0 <NA> 0 -1 No
4 0 0 <NA> 0 -1 Yes
5 0 0 <NA> 0 -1 No
[[3]]
[[3]]$m1
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 4 5 gender 1 1 <NA>
3 0 0 <NA> 0 -1 No
4 0 0 <NA> 0 -1 Yes
5 0 0 <NA> 0 -1 Yes
[[3]]$m2
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No
[[3]]$m3
left daughter right daughter split var split point status prediction
1 2 3 tenure 2.129427 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 0 0 <NA> 0.000000 -1 No
[[4]]
[[4]]$m1
left daughter right daughter split var split point status prediction
1 2 3 tenure 1.535877 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 4 5 tenure 4.015384 1 <NA>
4 0 0 <NA> 0.000000 -1 No
5 6 7 tenure 4.239396 1 <NA>
6 0 0 <NA> 0.000000 -1 Yes
7 0 0 <NA> 0.000000 -1 No
[[4]]$m2
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No
[[4]]$m3
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No
是不是一个'randomForest'随机结果呢? –