1
最近,当我尝试使用xgboost的CLI版本来预测输入时,我发现它的结果与python版本有很大不同。为什么我在Xgboost的python和CLI版本之间得到不同的预测结果?
与Python,我预测它是这样的:
data = xgb.DMatrix(X)
bst = xgb.Booster()
bst.load_model(modelfile)
leafindex = bst.predict(data, pred_leaf=False)
并如下使用CLI:
./xgboost xgboost.conf task=pred model_in=../models/gb.model_depth4_150trees_2016-07-02
这里是我的配置文件:
# General Parameters, see comment for each definition
# can be gbtree or gblinear
booster = gbtree
# choose logistic regression loss function for binary classification
objective = binary:logistic
# Tree Booster Parameters
# step size shrinkage
eta = 1.0
# minimum loss reduction required to make a further partition
gamma = 1.0
# minimum sum of instance weight(hessian) needed in a child
min_child_weight = 1
# maximum depth of a tree
max_depth = 4
# Task Parameters
# the number of round to do boosting
num_round = 150
# 0 means do not save any model except the final round model
save_period = 0
# The path of training data
data = "agaricus.txt.train"
# The path of validation data, used to monitor training process, here [test] sets name of the validation set
eval[test] = "agaricus.txt.test"
# The path of test data
test:data = "data"
Python的输入数据格式:
8 201 1 2 26 10000.0 8589934592 32 0 0 1000000.0 0
2 3 1 1 50 10000.0 8589934592 32 524288 8 1000000.0 0
2 3 2 2 19 10000.0 8589934592 512 512 8 1000000.0 0
4 24 1 1 23 10000.0 8589934592 8192 0 0 1000000.0 0
1 2 2 3 50 10000.0 8589934592 32 512 8 1000000.0 0
21 1 2 3 48 10000.0 8589934592 32 512 8 1000000.0 0
5 12 1 2 42 10000.0 137438953472 32 512 8 1000000.0 0
2 11 2 2 86 10000.0 0 0 0 0 1000000.0 0
1 10 2 8 99 10000.0 8589934592 32 65536 8 1000000.0 0
2 11 2 8 97 10000.0 8589934592 32 65536 8 1000000.0 0
4 5 1 1 4 10000.0 1073741824 32 0 0 1000000.0 0
...
CLI输入格式:
0 1:8 2:201 3:1 4:2 5:26 6:10000.0 7:8589934592 8:32 9:0 10:0 11:1000000.0 12:0
0 1:2 2:3 3:1 4:1 5:50 6:10000.0 7:8589934592 8:32 9:524288 10:8 11:1000000.0 12:0
0 1:2 2:3 3:2 4:2 5:19 6:10000.0 7:8589934592 8:512 9:512 10:8 11:1000000.0 12:0
0 1:4 2:24 3:1 4:1 5:23 6:10000.0 7:8589934592 8:8192 9:0 10:0 11:1000000.0 12:0
0 1:1 2:2 3:2 4:3 5:50 6:10000.0 7:8589934592 8:32 9:512 10:8 11:1000000.0 12:0
0 1:21 2:1 3:2 4:3 5:48 6:10000.0 7:8589934592 8:32 9:512 10:8 11:1000000.0 12:0
0 1:5 2:12 3:1 4:2 5:42 6:10000.0 7:137438953472 8:32 9:512 10:8 11:1000000.0 12:0
...
的Python版本的结果:
0.138298
0.00288907
0.0114002
0.0477143
0.00185653
0.00455882
0.000503023
0.000817317
0.00332584
0.00178041
0.0666806
0.03003
...
CLI版本:
0.000100178
0.201246
0.449562
0.0506984
0.451953
0.389587
0.034748
0.992795
0.00348666
0.00661674
0.0186095
0.0260032
0.996163
0.259104
0.552341
0.972762
...
我用同样的模型文件,和CLI版本了40%的价值高于0.5,这不符合我们的预期。