1
我正在使用Java Spark库建立一系列分布分析。这是我用来从JSON文件获取数据并保存输出的实际代码。如何使用2列合并的字段执行查询?
Dataset<Row> dataset = spark.read().json("local/foods.json");
dataset.createOrReplaceTempView("cs_food");
List<GenericAnalyticsEntry> menu_distribution= spark
.sql(" ****REQUESTED QUERY ****")
.toJavaRDD()
.map(row -> Triple.of(row.getString(0), BigDecimal.valueOf(row.getLong(1)), BigDecimal.valueOf(row.getLong(2))))
.map(GenericAnalyticsEntry::of)
.collect();
writeObjectAsJsonToHDFS(fs, "/local/output/menu_distribution_new.json", menu_distribution);
我要找的就是基于这种结构的查询:
+------------+-------------+------------+------------+
| FIRST_FOOD | SECOND_FOOD | DATE | IS_SPECIAL |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 11/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Lasagna | Pizza | 12/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Spaghetti | 13/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 14/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Lasagna | 15/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pork | Mozzarella | 16/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Lasagna | Mozzarella | 17/02/2017 | FALSE |
+------------+-------------+------------+------------+
我如何能实现从上面写的代码这(写在下面)输出?
+------------+--------------------+----------------------+
| FOODS | occurrences(First) | occurrences (Second) |
+------------+--------------------+----------------------+
| Pizza | 2 | 1 |
+------------+--------------------+----------------------+
| Lasagna | 2 | 1 |
+------------+--------------------+----------------------+
| Spaghetti | 2 | 3 |
+------------+--------------------+----------------------+
| Mozzarella | 0 | 2 |
+------------+--------------------+----------------------+
| Pork | 1 | 0 |
+------------+--------------------+----------------------+
我有,当然尝试自己找出一个解决方案,但曾与我的尝试没有运气,我可能是错的,但我需要的东西是这样的:
"SELECT (first_food + second_food) as menu, COUNT(first_food), COUNT(second_food) from cs_food GROUP BY menu"
此代码给了我这样的结果: “异常线程‘main’org.apache.spark.sql.AnalysisException:分组表达式序列是空的,‘menus.'first_food'’不是一个聚合函数。如果你不关心你得到了什么值,那么在'first()(或first_value)'中换行'(count(1)as'first_count')'或者换上'menus.'first_food'''。 –
我的不好:我遗漏了GROUP BY条款。现在修复。 –
编辑:需要将选择的第一行从“食物”替换为“食物”并且它正在工作......但是不常见字段的结果是NULL而不是0,有没有办法解决这个问题?在您选择 –