2016-08-25 57 views
1

最近,当我尝试使用xgboost的CLI版本来预测输入时,我发现它的结果与python版本有很大不同。为什么我在Xgboost的python和CLI版本之间得到不同的预测结果?

与Python,我预测它是这样的:

data = xgb.DMatrix(X) 
bst = xgb.Booster() 
bst.load_model(modelfile) 
leafindex = bst.predict(data, pred_leaf=False) 

并如下使用CLI:

./xgboost xgboost.conf task=pred model_in=../models/gb.model_depth4_150trees_2016-07-02 

这里是我的配置文件:

# General Parameters, see comment for each definition 
# can be gbtree or gblinear 
booster = gbtree 
# choose logistic regression loss function for binary classification 
objective = binary:logistic 

# Tree Booster Parameters 
# step size shrinkage 
eta = 1.0 
# minimum loss reduction required to make a further partition 
gamma = 1.0 
# minimum sum of instance weight(hessian) needed in a child 
min_child_weight = 1 
# maximum depth of a tree 
max_depth = 4 

# Task Parameters 
# the number of round to do boosting 
num_round = 150 
# 0 means do not save any model except the final round model 
save_period = 0 
# The path of training data 
data = "agaricus.txt.train" 
# The path of validation data, used to monitor training process, here [test] sets name of the validation set 
eval[test] = "agaricus.txt.test" 
# The path of test data 
test:data = "data" 

Python的输入数据格式:

8  201  1  2  26  10000.0 8589934592  32  0  0  1000000.0  0 
2  3  1  1  50  10000.0 8589934592  32  524288 8  1000000.0  0 
2  3  2  2  19  10000.0 8589934592  512  512  8  1000000.0  0 
4  24  1  1  23  10000.0 8589934592  8192 0  0  1000000.0  0 
1  2  2  3  50  10000.0 8589934592  32  512  8  1000000.0  0 
21  1  2  3  48  10000.0 8589934592  32  512  8  1000000.0  0 
5  12  1  2  42  10000.0 137438953472 32  512  8  1000000.0  0 
2  11  2  2  86  10000.0 0  0  0  0  1000000.0  0 
1  10  2  8  99  10000.0 8589934592  32  65536 8  1000000.0  0 
2  11  2  8  97  10000.0 8589934592  32  65536 8  1000000.0  0 
4  5  1  1  4  10000.0 1073741824  32  0  0  1000000.0  0 
... 

CLI输入格式:

0 1:8 2:201 3:1 4:2 5:26 6:10000.0 7:8589934592 8:32 9:0 10:0 11:1000000.0 12:0 
0 1:2 2:3 3:1 4:1 5:50 6:10000.0 7:8589934592 8:32 9:524288 10:8 11:1000000.0 12:0 
0 1:2 2:3 3:2 4:2 5:19 6:10000.0 7:8589934592 8:512 9:512 10:8 11:1000000.0 12:0 
0 1:4 2:24 3:1 4:1 5:23 6:10000.0 7:8589934592 8:8192 9:0 10:0 11:1000000.0 12:0 
0 1:1 2:2 3:2 4:3 5:50 6:10000.0 7:8589934592 8:32 9:512 10:8 11:1000000.0 12:0 
0 1:21 2:1 3:2 4:3 5:48 6:10000.0 7:8589934592 8:32 9:512 10:8 11:1000000.0 12:0 
0 1:5 2:12 3:1 4:2 5:42 6:10000.0 7:137438953472 8:32 9:512 10:8 11:1000000.0 12:0 
... 

的Python版本的结果:

0.138298 
0.00288907 
0.0114002 
0.0477143 
0.00185653 
0.00455882 
0.000503023 
0.000817317 
0.00332584 
0.00178041 
0.0666806 
0.03003 
... 

CLI版本:

0.000100178 
0.201246 
0.449562 
0.0506984 
0.451953 
0.389587 
0.034748 
0.992795 
0.00348666 
0.00661674 
0.0186095 
0.0260032 
0.996163 
0.259104 
0.552341 
0.972762 
... 

我用同样的模型文件,和CLI版本了40%的价值高于0.5,这不符合我们的预期。

回答

0

解决!

看来由python和cli训练过的模型文件不能彼此使用。 而当使用由每个自我训练的模型时,结果仍然像这些有点差异:

by python  by cli 
0.169874  0.222063 
0.999997  0.999554 
0.00454239  0.000879413 
0.0140518  0.00824018 
0.0148116  0.00859811 
0.000353913  0.000880754 
0.0207635  0.019058 
0.000916939  0.000579058 
0.00109237  0.000286653 
0.00247333  0.00272115 
0.0650928  0.0319875 
0.946068  0.965301 
0.997704  0.999615 
0.987644  0.991665 
0.997242  0.984403 
0.948666  0.909703 
0.000781899  0.00079996 
0.000319449  0.000138011 
0.0400793  0.164134 
0.00216081  0.000781626 
0.023867  0.0323994 
相关问题