2016-06-01 117 views
3

在Spark Mllib(F分数,AUROC,AUPRC等)中训练随机森林二进制分类器模型时,我们如何获得模型度量?Spark Spark随机森林二进制分类器指标

的问题是,BinaryClassificationMetrics发生概率而随机森林分类的​​预测方法返回离散值0或1。

参见:https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification

一个RandomForest.trainClassifier没有任何clearThreshold方法,这将使它回归概率而不是离散的0或1个标签。

+2

的可能的复制[1.5.1火花,MLLib随机森林的概率(http://stackoverflow.com/questions/33401437/spark-1-5-1-mllib-random-forest-probability) – eliasah

+0

@eliasah实际上并不是一个重复的问题,但其中的答案提供了问题的解决方案。在您评论之前,我已经在答案中添加了这一点。 –

+0

没关系。没问题 !因此,使用“可能”一词 – eliasah

回答

2

我们需要使用新的基于DataFrame的API ml来获取概率,而不是基于RDD的mllib API。

更新

以下是从火花文档更新例如使用BinaryClassificationEvaluator并显示指标:Area Under Receiver Operating Characteristic(AUROC)和Area Under Precision Recall Curve(AUPRC)。

import org.apache.spark.ml.Pipeline 
import org.apache.spark.ml.classification.RandomForestClassifier 
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator 
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer} 

// Load and parse the data file, converting it to a DataFrame. 
val data = sqlContext.read.format("libsvm").load("D:/Sources/spark/data/mllib/sample_libsvm_data.txt") 

// Index labels, adding metadata to the label column. 
// Fit on whole dataset to include all labels in index. 
val labelIndexer = new StringIndexer() 
    .setInputCol("label") 
    .setOutputCol("indexedLabel") 
    .fit(data) 

// Automatically identify categorical features, and index them. 
// Set maxCategories so features with > 4 distinct values are treated as continuous. 
val featureIndexer = new VectorIndexer() 
    .setInputCol("features") 
    .setOutputCol("indexedFeatures") 
    .setMaxCategories(4) 
    .fit(data) 

// Split the data into training and test sets (30% held out for testing) 
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3)) 

// Train a RandomForest model. 
val rf = new RandomForestClassifier() 
    .setLabelCol("indexedLabel") 
    .setFeaturesCol("indexedFeatures") 
    .setNumTrees(10) 

// Convert indexed labels back to original labels. 
val labelConverter = new IndexToString() 
    .setInputCol("prediction") 
    .setOutputCol("predictedLabel") 
    .setLabels(labelIndexer.labels) 

// Chain indexers and forest in a Pipeline 
val pipeline = new Pipeline() 
    .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter)) 

// Train model. This also runs the indexers. 
val model = pipeline.fit(trainingData) 

// Make predictions. 
val predictions = model.transform(testData) 

// Select example rows to display. 
predictions 
    .select("indexedLabel", "rawPrediction", "prediction") 
    .show() 

val binaryClassificationEvaluator = new BinaryClassificationEvaluator() 
    .setLabelCol("indexedLabel") 
    .setRawPredictionCol("rawPrediction") 

def printlnMetric(metricName: String): Unit = { 
    println(metricName + " = " + binaryClassificationEvaluator.setMetricName(metricName).evaluate(predictions)) 
} 

printlnMetric("areaUnderROC") 
printlnMetric("areaUnderPR") 
+2

如果向下选民会解释为什么这样可以改进答案,那将是非常有用的。 –

相关问题