2017-03-04 462 views
2

大家好,我正在尝试通过for循环搜索最佳参数。但是,结果让我很困惑。下面的代码应该提供相同的结果,因为参数“mtry”是相同的。在循环中寻找最好的随机森林参数r

 gender Partner tenure Churn 
3521  Male  No 0.992313 Yes 
2525.1 Male  No 4.276666 No 
567  Male  Yes 2.708050 No 
8381 Female  No 4.202127 Yes 
6258 Female  No 0.000000 Yes 
6569  Male  Yes 2.079442 No 
27410 Female  No 1.550804 Yes 
6429 Female  No 1.791759 Yes 
412 Female  Yes 3.828641 No 
4655 Female  Yes 3.737670 No 

RFModel = randomForest(Churn ~ ., 
        data = ggg, 
        ntree = 30, 
        mtry = 2, 
        importance = TRUE, 
        replace = FALSE) 
print(RFModel$confusion) 

    No Yes class.error 
No 4 1   0.2 
Yes 1 4   0.2 

for(i in c(2)){ 
    RFModel = randomForest(Churn ~ ., 
        data = Trainingds, 
        ntree = 30, 
        mtry = i, 
        importance = TRUE, 
        replace = FALSE) 
    print(RFModel$confusion) 
} 

    No Yes class.error 
No 3 2   0.4 
Yes 2 3   0.4 

  1. 代码1和代码2应该提供相同的输出。
+0

是不是一个'randomForest'随机结果呢? –

回答

2

每次你都会得到稍微不同的结果,因为随机性被嵌入到算法中。为了构建每棵树,该算法重新采样数据帧,并随机选择mtry列以从重新采样的数据帧构建树。如果您想要使用相同参数(例如,mtry,ntree)构建的模型每次都给出相同的结果,则需要设置一个随机种子。

例如,我们运行randomForest 10次,并检查每次运行的均方误差的均值。需要注意的是,平均MSE每一次不同的是:

library(randomForest) 

replicate(10, mean(randomForest(mpg ~ ., data=mtcars)$mse)) 
[1] 5.998530 6.307782 5.791657 6.125588 5.868717 5.845616 5.427208 6.112762 5.777624 6.150021 

如果你运行上面的代码,你会得到另外10个值是从上面的值不同。

如果您希望能够重现使用相同参数运行的给定模型(例如,mtryntree),则可以设置随机种子。例如:

set.seed(5) 
mean(randomForest(mpg ~ ., data=mtcars)$mse) 
[1] 6.017737 

你会得到相同的结果,如果你使用相同的种子值,但在其他不同的结果。使用更大的值ntree将减少,但不能消除模型运行之间的差异。

更新:当我用您提供的数据样本运行代码时,我并不总是每次都得到相同的结果。即使replace=TRUE,这导致所述数据帧无需更换被采样,选择的列来构建树中的每个时间可以是不同的:

> randomForest(Churn ~ ., 
+    data = ggg, 
+    ntree = 30, 
+    mtry = 2, 
+    importance = TRUE, 
+    replace = FALSE) 

Call: 
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,  importance = TRUE, replace = FALSE) 
       Type of random forest: classification 
        Number of trees: 30 
No. of variables tried at each split: 2 

     OOB estimate of error rate: 30% 
Confusion matrix: 
    No Yes class.error 
No 3 2   0.4 
Yes 1 4   0.2 
> randomForest(Churn ~ ., 
+    data = ggg, 
+    ntree = 30, 
+    mtry = 2, 
+    importance = TRUE, 
+    replace = FALSE) 

Call: 
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,  importance = TRUE, replace = FALSE) 
       Type of random forest: classification 
        Number of trees: 30 
No. of variables tried at each split: 2 

     OOB estimate of error rate: 20% 
Confusion matrix: 
    No Yes class.error 
No 4 1   0.2 
Yes 1 4   0.2 
> randomForest(Churn ~ ., 
+    data = ggg, 
+    ntree = 30, 
+    mtry = 2, 
+    importance = TRUE, 
+    replace = FALSE) 

Call: 
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2,  importance = TRUE, replace = FALSE) 
       Type of random forest: classification 
        Number of trees: 30 
No. of variables tried at each split: 2 

     OOB estimate of error rate: 30% 
Confusion matrix: 
    No Yes class.error 
No 3 2   0.4 
Yes 1 4   0.2 

下面是一组类似的带结果内置iris数据帧:

> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE, 
+    replace = FALSE) 

Call: 
randomForest(formula = Species ~ ., data = iris, ntree = 30,  mtry = 2, importance = TRUE, replace = FALSE) 
       Type of random forest: classification 
        Number of trees: 30 
No. of variables tried at each split: 2 

     OOB estimate of error rate: 3.33% 
Confusion matrix: 
      setosa versicolor virginica class.error 
setosa   50   0   0  0.00 
versicolor  0   47   3  0.06 
virginica  0   2  48  0.04 
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE, 
+    replace = FALSE) 

Call: 
randomForest(formula = Species ~ ., data = iris, ntree = 30,  mtry = 2, importance = TRUE, replace = FALSE) 
       Type of random forest: classification 
        Number of trees: 30 
No. of variables tried at each split: 2 

     OOB estimate of error rate: 4.67% 
Confusion matrix: 
      setosa versicolor virginica class.error 
setosa   50   0   0  0.00 
versicolor  0   47   3  0.06 
virginica  0   4  46  0.08 
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE, 
+    replace = FALSE) 

Call: 
randomForest(formula = Species ~ ., data = iris, ntree = 30,  mtry = 2, importance = TRUE, replace = FALSE) 
       Type of random forest: classification 
        Number of trees: 30 
No. of variables tried at each split: 2 

     OOB estimate of error rate: 6% 
Confusion matrix: 
      setosa versicolor virginica class.error 
setosa   50   0   0  0.00 
versicolor  0   47   3  0.06 
virginica  0   6  44  0.12 

您还可以看看每个模型运行生成的树,它们通常会有所不同。例如,假设我运行以下代码三次,将结果存储在对象m1m2m3中。

randomForest(Churn ~ ., 
      data = ggg, 
      ntree = 30, 
      mtry = 2, 
      importance = TRUE, 
      replace = FALSE) 

现在我们来看看每个模型对象的前四个树,我已经粘贴在下面。输出是一个列表。你可以看到每个模型运行的第一棵树是不同的。第二个树对于前两个模型运行是相同的,但对于第三个模型运行不同,依此类推。

check.trees = lapply(1:4, function(i) { 
    lapply(list(m1=m1,m2=m2,m3=m3), function(model) getTree(model, i, labelVar=TRUE)) 
    }) 

check.trees 
[[1]] 
[[1]]$m1 
    left daughter right daughter split var split point status prediction 
1    2    3 Partner 1.000000  1  <NA> 
2    4    5 gender 1.000000  1  <NA> 
3    0    0  <NA> 0.000000  -1   No 
4    0    0  <NA> 0.000000  -1  Yes 
5    6    7 tenure 2.634489  1  <NA> 
6    0    0  <NA> 0.000000  -1  Yes 
7    0    0  <NA> 0.000000  -1   No 

[[1]]$m2 
    left daughter right daughter split var split point status prediction 
1    2    3 gender 1.000000  1  <NA> 
2    0    0  <NA> 0.000000  -1  Yes 
3    4    5 tenure 1.850182  1  <NA> 
4    0    0  <NA> 0.000000  -1  Yes 
5    0    0  <NA> 0.000000  -1   No 

[[1]]$m3 
    left daughter right daughter split var split point status prediction 
1    2    3 tenure 2.249904  1  <NA> 
2    0    0  <NA> 0.000000  -1  Yes 
3    0    0  <NA> 0.000000  -1   No 


[[2]] 
[[2]]$m1 
    left daughter right daughter split var split point status prediction 
1    2    3 Partner   1  1  <NA> 
2    0    0  <NA>   0  -1  Yes 
3    0    0  <NA>   0  -1   No 

[[2]]$m2 
    left daughter right daughter split var split point status prediction 
1    2    3 Partner   1  1  <NA> 
2    0    0  <NA>   0  -1  Yes 
3    0    0  <NA>   0  -1   No 

[[2]]$m3 
    left daughter right daughter split var split point status prediction 
1    2    3 Partner   1  1  <NA> 
2    4    5 gender   1  1  <NA> 
3    0    0  <NA>   0  -1   No 
4    0    0  <NA>   0  -1  Yes 
5    0    0  <NA>   0  -1   No 


[[3]] 
[[3]]$m1 
    left daughter right daughter split var split point status prediction 
1    2    3 Partner   1  1  <NA> 
2    4    5 gender   1  1  <NA> 
3    0    0  <NA>   0  -1   No 
4    0    0  <NA>   0  -1  Yes 
5    0    0  <NA>   0  -1  Yes 

[[3]]$m2 
    left daughter right daughter split var split point status prediction 
1    2    3 Partner   1  1  <NA> 
2    0    0  <NA>   0  -1  Yes 
3    0    0  <NA>   0  -1   No 

[[3]]$m3 
    left daughter right daughter split var split point status prediction 
1    2    3 tenure 2.129427  1  <NA> 
2    0    0  <NA> 0.000000  -1  Yes 
3    0    0  <NA> 0.000000  -1   No 


[[4]] 
[[4]]$m1 
    left daughter right daughter split var split point status prediction 
1    2    3 tenure 1.535877  1  <NA> 
2    0    0  <NA> 0.000000  -1  Yes 
3    4    5 tenure 4.015384  1  <NA> 
4    0    0  <NA> 0.000000  -1   No 
5    6    7 tenure 4.239396  1  <NA> 
6    0    0  <NA> 0.000000  -1  Yes 
7    0    0  <NA> 0.000000  -1   No 

[[4]]$m2 
    left daughter right daughter split var split point status prediction 
1    2    3 Partner   1  1  <NA> 
2    0    0  <NA>   0  -1  Yes 
3    0    0  <NA>   0  -1   No 

[[4]]$m3 
    left daughter right daughter split var split point status prediction 
1    2    3 Partner   1  1  <NA> 
2    0    0  <NA>   0  -1  Yes 
3    0    0  <NA>   0  -1   No 
+0

但是如果我运行第一个代码10次,我得到了同样的混淆矩阵。 – Frasher

+0

请提供一些适用于您的代码的示例数据,并重现您遇到的问题。使用'dput'来提供数据样本。 – eipi10

+0

你是完全正确的。我非常感谢你的回应。一旦我在代码1的第一行添加set.seed(),并在代码2的内部添加循环,我会得到相同的结果。非常感谢你。 – Frasher