2015-10-19 134 views
1

我有10个以上的功能和一万个案例来训练逻辑回归分类人的种族。第一个例子是法语与非法语,第二个例子是英语与非英语。结果如下:如何解读这个三角形的ROC AUC曲线?

////////////////////////////////////////////////////// 

1= fr 
0= non-fr 
Class count: 
0 69109 
1 30891 
dtype: int64 
Accuracy: 0.95126 
Classification report: 
      precision recall f1-score support 

      0  0.97  0.96  0.96  34547 
      1  0.92  0.93  0.92  15453 

avg/total  0.95  0.95  0.95  50000 

Confusion matrix: 
[[33229 1318] 
[ 1119 14334]] 
AUC= 0.944717975754 

////////////////////////////////////////////////////// 

1= en 
0= non-en 
Class count: 
0 76125 
1 23875 
dtype: int64 
Accuracy: 0.7675 
Classification report: 
      precision recall f1-score support 

      0  0.91  0.78  0.84  38245 
      1  0.50  0.74  0.60  11755 

avg/total  0.81  0.77  0.78  50000 

Confusion matrix: 
[[29677 8568] 
[ 3057 8698]] 
AUC= 0.757955582999 

////////////////////////////////////////////////////// 

不过,我正在与trianglar形状,而不是锯齿状的圆曲线一些很奇怪的看着AUC曲线。任何解释为什么我得到这样的形状?我所犯的任何可能的错误?

enter image description here enter image description here

代码:

all_dict = [] 
    for i in range(0, len(my_dict)): 
     temp_dict = dict(my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items() 
      + my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items() 
      + my_dict9[i].items() + my_dict10[i].items() + my_dict11[i].items() + my_dict12[i].items() 
      + my_dict13[i].items() + my_dict14[i].items() + my_dict15[i].items() + my_dict16[i].items() 
      ) 
     all_dict.append(temp_dict) 

    newX = dv.fit_transform(all_dict) 

    # Separate the training and testing data sets 
    half_cut = int(len(df)/2.0)*-1 
    X_train = newX[:half_cut] 
    X_test = newX[half_cut:] 
    y_train = y[:half_cut] 
    y_test = y[half_cut:] 

    # Fitting X and y into model, using training data 
    #$$ 
    lr.fit(X_train, y_train) 

    # Making predictions using trained data 
    #$$ 
    y_train_predictions = lr.predict(X_train) 
    #$$ 
    y_test_predictions = lr.predict(X_test) 

    #print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0]) 
    print 'Accuracy:',(y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0]) 

    print 'Classification report:' 
    print classification_report(y_test, y_test_predictions) 
    #print sk_confusion_matrix(y_train, y_train_predictions) 
    print 'Confusion matrix:' 
    print sk_confusion_matrix(y_test, y_test_predictions) 

    #print y_test[1:20] 
    #print y_test_predictions[1:20] 

    #print y_test[1:10] 
    #print np.bincount(y_test) 
    #print np.bincount(y_test_predictions) 

    # Find and plot AUC 
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_predictions) 
    roc_auc = auc(false_positive_rate, true_positive_rate) 
    print 'AUC=',roc_auc 

    plt.title('Receiver Operating Characteristic') 
    plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc) 
    plt.legend(loc='lower right') 
    plt.plot([0,1],[0,1],'r--') 
    plt.xlim([-0.1,1.2]) 
    plt.ylim([-0.1,1.2]) 
    plt.ylabel('True Positive Rate') 
    plt.xlabel('False Positive Rate') 
    plt.show() 
+0

所以这里是绘制ROC曲线的代码? – cel

+0

已添加到原始帖子中 – KubiK888

回答

4

你就错了。根据文档:

y_score : array, shape = [n_samples] 

    Target scores, can either be probability estimates of the positive class or confidence values. 

因此,在这一行:

roc_curve(y_test, y_test_predictions) 

你应该进入(或两列从predict_proba结果)的decision_functionroc_curve作用的结果,而不是实际的预测。

看看这些例子http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#example-model-selection-plot-roc-py

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py