2017-07-24 126 views
-1

1)我必须构建一个代码来读取spark中的json文件。 我正在使用spark.read.json(“sample.json”)。 但即使是简单的JSON文件中像下面Spark JSON读取失败

{ 
    {"id" : "1201", "name" : "satish", "age" : "25"} 
    {"id" : "1202", "name" : "krishna", "age" : "28"} 
    {"id" : "1203", "name" : "amith", "age" : "39"} 
    {"id" : "1204", "name" : "javed", "age" : "23"} 
    {"id" : "1205", "name" : "prudvi", "age" : "23"} 
} 

我得到错误的结果

+---------------+----+----+-------+ 
|_corrupt_record| age| id| name| 
+---------------+----+----+-------+ 
|    {|null|null| null| 
|   null| 25|1201| satish| 
|   null| 28|1202|krishna| 
|   null| 39|1203| amith| 
|   null| 23|1204| javed| 
|   null| 23|1205| prudvi| 
|    }|null|null| null| 
+---------------+----+----+-------+ 

我发现上面的例子here

2:)。此外,我不知道如何处理格式不正确的json文件,如下所示

{ 
    "title": "Person", 
    "type": "object", 
    "properties": { 
     "firstName": { 
      "type": "string" 
     }, 
     "lastName": { 
      "type": "string" 
     }, 
     "age": { 
      "description": "Age in years", 
      "type": "integer", 
      "minimum": 0 
     } 
    }, 
    "required": ["firstName", "lastName"] 
} 

我发现使用这些文件很困难。 有没有应对的Java/Scala中除了JSON文件从火花

请帮

感谢任何连贯的方式!

回答

1

你的JSON文件应该是这样的:

{"id" : "1201", "name" : "satish", "age" : "25"} 
{"id" : "1202", "name" : "krishna", "age" : "28"} 
{"id" : "1203", "name" : "amith", "age" : "39"} 
{"id" : "1204", "name" : "javed", "age" : "23"} 
{"id" : "1205", "name" : "prudvi", "age" : "23"} 

,代码为:

%spark.pyspark 

# sqlContext 
sq = sqlc 

# setup input 
file_json = "hdfs://mycluster/user/test/test.json" 

df = sqlc.read.json(file_json) 
df.registerTempTable("myfile") 

df2 = sqlc.sql("SELECT * FROM myfile") 

df2.show() 

输出:

+---+----+-------+ 
|age| id| name| 
+---+----+-------+ 
| 25|1201| satish| 
| 28|1202|krishna| 
| 39|1203| amith| 
| 23|1204| javed| 
| 23|1205| prudvi| 
+---+----+-------+