我们需要在spark中计算大量数据集合中的距离矩阵,如jaccard。 面对几个问题。请帮助我们提供指导。在Apache Spark中使用map函数进行巨大的操作
1期
import info.debatty.java.stringsimilarity.Jaccard;
//sample Data set creation
List<Row> data = Arrays.asList(
RowFactory.create("Hi I heard about Spark", "Hi I Know about Spark"),
RowFactory.create("I wish Java could use case classes","I wish C# could use case classes"),
RowFactory.create("Logistic,regression,models,are,neat","Logistic,regression,models,are,neat"));
StructType schema = new StructType(new StructField[] {new StructField("label", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
// Distance matrix object creation
Jaccard jaccard=new Jaccard();
//Working on each of the member element of dataset and applying distance matrix.
Dataset<String> sentenceDataFrame1 =sentenceDataFrame.map(
(MapFunction<Row, String>) row -> "Name: " + jaccard.similarity(row.getString(0),row.getString(1)),Encoders.STRING()
);
sentenceDataFrame1.show();
没有编译时错误。但是,让运行时异常,如:
org.apache.spark.SparkException:任务不可序列
第2期
此外,我们需要找到哪对是有最高得分,我们需要声明一些变量。此外,我们还需要执行其他计算,我们面临着很多困难。
即使我尝试在MapBlock中声明一个像counter这样的简单变量,我们也无法捕获增加的值。如果我们在Map块之外声明,我们会收到很多编译时错误。
int counter=0;
Dataset<String> sentenceDataFrame1 =sentenceDataFrame.map(
(MapFunction<Row, String>) row -> {
System.out.println("Name: " + row.getString(1));
//int counter = 0;
counter++;
System.out.println("Counter: " + counter);
return counter+"";
},Encoders.STRING()
);
请给我们指点。 谢谢。