如果每个URL是在其他行,那么你可以做的foreach:
SparkConf conf = new SparkConf().setAppName("org.sparkexample.WordCount").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> textFile = context.textFile("urlFile");
textFile.foreach (new VoidFunction<String>() {
public void call (String line) {
// this code will be executed parallely for each line in file
ExtractTrainingData ed = new ExtractTrainingData();
List<Elements> list = ed.getElementList(inputUrl);
ed.processElementList(inputUrl, list);
}
});
如果结果列表也应并行,则:
SparkConf conf = new SparkConf().setAppName("org.sparkexample.WordCount").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> textFile = context.textFile("urlFile");
textFile.map (new Function<String, List<Elements>() {
public List<Elements> call (String line) {
// this code will be executed parallely for each line in file
ExtractTrainingData ed = new ExtractTrainingData();
List<Elements> list = ed.getElementList(inputUrl);
return list;
}
}).flatMap (list -> list.iterator())
.foreach ((String element) -> {
// here put code that is in processElementList
});
我用lambda语法,你可以使用,当然匿名函数
编辑:确保Elements
是序列化
什么样的你想对文本文件进行处理吗?我在猜测是否会通过网址发送http请求?你想在一个rdd请求结果吗? –