2017-09-14 64 views
1

让我有些Jsons如下数据框中检查是否嵌套JSON列中存在

{"Location": 
    {"filter": 
     {"name": "houston", "Disaster": "hurricane"}, 
    } 
} 
{"Location": 
    {"filter": 
     {"name": "florida", "Disaster": "hurricane"}, 
    } 
} 
{"Location": 
    {"filter": 
     {"name": "seattle"}, 
    } 
} 

我用spark.read.json(“myfile.json”)后,我想筛选出的数据行时,不包含灾难。在我的例子中,西雅图行应该被过滤掉。

我试图

val newTable = df.filter($"Location.filter.Disaster" isnotnull) 

但给我的struct灾难不存在错误。

那么我该如何做到这一点?

感谢

回答

0

json数据似乎已损坏,即它不能通过使用spark.read.json("myfile.json")

有解决类似的问题通过使用wholeTextFiles API读入有效的数据帧

val rdd = sc.wholeTextFiles("myfile.json") 
val json = rdd.flatMap(_._2.replace(":\n", ":").replace(",\n", "").replace("}\n", "}").replace(" ", "").replace("}{", "}\n{").split("\n")) 

这应该会给你rdd数据(个有效jsons)作为

{"Location":{"filter":{"name":"houston","Disaster":"hurricane"}}} 
{"Location":{"filter":{"name":"florida","Disaster":"hurricane"}}} 
{"Location":{"filter":{"name":"seattle"}}} 

现在你可以阅读json rdddataframe

val df = sqlContext.read.json(json) 

这应该给你

+---------------------+ 
|Location    | 
+---------------------+ 
|[[hurricane,houston]]| 
|[[hurricane,florida]]| 
|[[null,seattle]]  | 
+---------------------+ 

schema

root 
|-- Location: struct (nullable = true) 
| |-- filter: struct (nullable = true) 
| | |-- Disaster: string (nullable = true) 
| | |-- name: string (nullable = true) 

现在,你有一个有效的数据帧,您可以将filter你申请

val newTable = df.filter($"Location.filter.Disaster" isnotnull) 

newTable

+---------------------+ 
|Location    | 
+---------------------+ 
|[[hurricane,houston]]| 
|[[hurricane,florida]]| 
+---------------------+