数据框中检查是否嵌套JSON列中存在

让我有些Jsons如下数据框中检查是否嵌套JSON列中存在

{"Location": 
    {"filter": 
     {"name": "houston", "Disaster": "hurricane"}, 
    } 
} 
{"Location": 
    {"filter": 
     {"name": "florida", "Disaster": "hurricane"}, 
    } 
} 
{"Location": 
    {"filter": 
     {"name": "seattle"}, 
    } 
}

我用spark.read.json（“myfile.json”）后，我想筛选出的数据行时，不包含灾难。在我的例子中，西雅图行应该被过滤掉。

我试图

val newTable = df.filter($"Location.filter.Disaster" isnotnull)

但给我的struct灾难不存在错误。

那么我该如何做到这一点？

感谢

来源

2017-09-14 Chen Fan

你json数据似乎已损坏，即它不能通过使用spark.read.json("myfile.json")

有解决类似的问题通过使用wholeTextFiles API读入有效的数据帧

val rdd = sc.wholeTextFiles("myfile.json") 
val json = rdd.flatMap(_._2.replace(":\n", ":").replace(",\n", "").replace("}\n", "}").replace(" ", "").replace("}{", "}\n{").split("\n"))

这应该会给你rdd数据（个有效jsons）作为

{"Location":{"filter":{"name":"houston","Disaster":"hurricane"}}} 
{"Location":{"filter":{"name":"florida","Disaster":"hurricane"}}} 
{"Location":{"filter":{"name":"seattle"}}}

现在你可以阅读json rdd到dataframe

val df = sqlContext.read.json(json)

这应该给你

+---------------------+ 
|Location    | 
+---------------------+ 
|[[hurricane,houston]]| 
|[[hurricane,florida]]| 
|[[null,seattle]]  | 
+---------------------+

与schema为

root 
|-- Location: struct (nullable = true) 
| |-- filter: struct (nullable = true) 
| | |-- Disaster: string (nullable = true) 
| | |-- name: string (nullable = true)

现在，你有一个有效的数据帧，您可以将filter你申请

val newTable = df.filter($"Location.filter.Disaster" isnotnull)

newTable将

+---------------------+ 
|Location    | 
+---------------------+ 
|[[hurricane,houston]]| 
|[[hurricane,florida]]| 
+---------------------+

来源

2017-09-15 02:18:14

数据框中检查是否嵌套JSON列中存在

回答

相关问题