2016-11-27 130 views
0

我有以下使用决策树进行分类的代码。我需要将测试数据集的预测转化为java数组并打印出来。有人可以帮我扩展这个代码。我需要一个预测标签和实际标签的二维数组,并打印预测标签。Apache Spark决策树预测

public class DecisionTreeClass { 
    public static void main(String args[]){ 
     SparkConf sparkConf = new SparkConf().setAppName("DecisionTreeClass").setMaster("local[2]"); 
     JavaSparkContext jsc = new JavaSparkContext(sparkConf); 


     // Load and parse the data file. 
     String datapath = "/home/thamali/Desktop/tlib.txt"; 
     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();//A training example used in supervised learning is called a “labeled point” in MLlib. 
     // Split the data into training and test sets (30% held out for testing) 
     JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3}); 
     JavaRDD<LabeledPoint> trainingData = splits[0]; 
     JavaRDD<LabeledPoint> testData = splits[1]; 

     // Set parameters. 
     // Empty categoricalFeaturesInfo indicates all features are continuous. 
     Integer numClasses = 12; 
     Map<Integer, Integer> categoricalFeaturesInfo = new HashMap(); 
     String impurity = "gini"; 
     Integer maxDepth = 5; 
     Integer maxBins = 32; 

     // Train a DecisionTree model for classification. 
     final DecisionTreeModel model = DecisionTree.trainClassifier(trainingData, numClasses, 
       categoricalFeaturesInfo, impurity, maxDepth, maxBins); 

     // Evaluate model on test instances and compute test error 
     JavaPairRDD<Double, Double> predictionAndLabel = 
       testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { 
        @Override 
        public Tuple2<Double, Double> call(LabeledPoint p) { 
         return new Tuple2(model.predict(p.features()), p.label()); 
        } 
       }); 

     Double testErr = 
       1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { 
        @Override 
        public Boolean call(Tuple2<Double, Double> pl) { 
         return !pl._1().equals(pl._2()); 
        } 
       }).count()/testData.count(); 

     System.out.println("Test Error: " + testErr); 
     System.out.println("Learned classification tree model:\n" + model.toDebugString()); 


    } 

} 

回答

1

你基本上已经完全与预测和标签变量。如果你真的需要一个2D双阵列的列表,你可以改变你使用的方法:

JavaRDD<double[]> valuesAndPreds = testData.map(point -> new double[]{model.predict(point.features()), point.label()}); 

,并就2D双阵列的列表,参考运行collect

List<double[]> values = valuesAndPreds.collect(); 

我会看看这里的文档:https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html。您还可以使用MulticlassMetrics等类来更改数据以获取模型的其他静态性能度量。这需要将mapToPair函数更改为map函数,并将泛型更改为对象。因此,像:

JavaRDD<Tuple2<Object, Object>> valuesAndPreds = testData().map(point -> new Tuple2<>(model.predict(point.features()), point.label())); 

然后运行:

MulticlassMetrics multiclassMetrics = new MulticlassMetrics(JavaRDD.toRDD(valuesAndPreds)); 

所有的这些东西是星火的MLLib文档中很好的记录。另外,你提到需要打印结果。如果这是作业,我会让你弄清楚这一部分,因为从列表中学习如何做是一个很好的练习。

编辑:

也注意到,您使用的是Java 7,和我有什么是从Java 8.要回答如何变成一个二维double数组你的主要问题,你会怎么做:

JavaRDD<double[]> valuesAndPreds = testData.map(new org.apache.spark.api.java.function.Function<LabeledPoint, double[]>() { 
       @Override 
       public double[] call(LabeledPoint point) { 
        return new double[]{model.predict(point.features()), point.label()}; 
       } 
      }); 

然后运行collect,得到两个双打的列表。此外,要给出打印部分的提示,请查看java.util.Arrays toString实现。