2017-08-04 78 views
1

我正在使用Java Spark库建立一系列分布分析。这是我用来从JSON文件获取数据并保存输出的实际代码。如何使用2列合并的字段执行查询?

Dataset<Row> dataset = spark.read().json("local/foods.json"); 
     dataset.createOrReplaceTempView("cs_food"); 

List<GenericAnalyticsEntry> menu_distribution= spark 
     .sql(" ****REQUESTED QUERY ****") 
     .toJavaRDD() 
     .map(row -> Triple.of(row.getString(0), BigDecimal.valueOf(row.getLong(1)), BigDecimal.valueOf(row.getLong(2)))) 
     .map(GenericAnalyticsEntry::of) 
     .collect(); 

writeObjectAsJsonToHDFS(fs, "/local/output/menu_distribution_new.json", menu_distribution); 

我要找的就是基于这种结构的查询:

+------------+-------------+------------+------------+ 
| FIRST_FOOD | SECOND_FOOD | DATE  | IS_SPECIAL | 
+------------+-------------+------------+------------+ 
| Pizza  | Spaghetti | 11/02/2017 | TRUE  | 
+------------+-------------+------------+------------+ 
| Lasagna | Pizza  | 12/02/2017 | TRUE  | 
+------------+-------------+------------+------------+ 
| Spaghetti | Spaghetti | 13/02/2017 | FALSE  | 
+------------+-------------+------------+------------+ 
| Pizza  | Spaghetti | 14/02/2017 | TRUE  | 
+------------+-------------+------------+------------+ 
| Spaghetti | Lasagna  | 15/02/2017 | FALSE  | 
+------------+-------------+------------+------------+ 
| Pork  | Mozzarella | 16/02/2017 | FALSE  | 
+------------+-------------+------------+------------+ 
| Lasagna | Mozzarella | 17/02/2017 | FALSE  | 
+------------+-------------+------------+------------+ 

我如何能实现从上面写的代码这(写在下面)输出?

+------------+--------------------+----------------------+ 
| FOODS  | occurrences(First) | occurrences (Second) | 
+------------+--------------------+----------------------+ 
| Pizza  | 2     | 1     | 
+------------+--------------------+----------------------+ 
| Lasagna | 2     | 1     | 
+------------+--------------------+----------------------+ 
| Spaghetti | 2     | 3     | 
+------------+--------------------+----------------------+ 
| Mozzarella | 0     | 2     | 
+------------+--------------------+----------------------+ 
| Pork  | 1     | 0     | 
+------------+--------------------+----------------------+ 

我有,当然尝试自己找出一个解决方案,但曾与我的尝试没有运气,我可能是错的,但我需要的东西是这样的:

"SELECT (first_food + second_food) as menu, COUNT(first_food), COUNT(second_food) from cs_food GROUP BY menu" 

回答

1

从例如数据,这看起来像它会产生你想要的输出:

select 
    foods, 
    first_count, 
    second_count 
from 
    (select first_food as food from menus 
    union select second_food from menus) as f 
    left join (
     select first_food, count(*) as first_count from menus 
     group by first_food 
     ) as ff on ff.first_food=f.food 
    left join (
     select second_food, count(*) as second_count from menus 
     group by second_food 
     ) as sf on sf.second_food=f.food 
; 
+0

此代码给了我这样的结果: “异常线程‘main’org.apache.spark.sql.AnalysisException:分组表达式序列是空的,‘menus.'first_food'’不是一个聚合函数。如果你不关心你得到了什么值,那么在'first()(或first_value)'中换行'(count(1)as'first_count')'或者换上'menus.'first_food'''。 –

+0

我的不好:我遗漏了GROUP BY条款。现在修复。 –

+0

编辑:需要将选择的第一行从“食物”替换为“食物”并且它正在工作......但是不常见字段的结果是NULL而不是0,有没有办法解决这个问题?在您选择 –

0

flatMap和GROUPBY的简单组合应该做这样的工作(对不起,不能检查是否100%正确的现在):

import spark.sqlContext.implicits._ 
val df = Seq(("Pizza", "Pasta"), ("Pizza", "Soup")).toDF("first", "second") 
df.flatMap({case Row(first: String, second: String) => Seq((first, 1, 0), (second, 0, 1))}) 
    .groupBy("_1")