我嵌套了JSON,并希望以表格结构输出。我能够单独解析JSON值,但在制表中存在一些问题。我可以通过数据框轻松完成。但我想用“RDD ONLY”功能来做。任何帮助非常感谢。使用Spark-Scala将表格结构压扁JSON RDD only fucntion
输入JSON:
{ "level":{"productReference":{
"prodID":"1234",
"unitOfMeasure":"EA"
},
"states":[
{
"state":"SELL",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":1400.0,
"stockKeepingLevel":"A"
}
},
{
"state":"HELD",
"effectiveDateTime":"2015-10-09T00:55:23.6345Z",
"stockQuantity":{
"quantity":800.0,
"stockKeepingLevel":"B"
}
}
] }}
预期输出:
我试过下面星火代码。但获取像这样的输出和Row()对象不能解析这个。
079562193 EA,List(SELLABLE,HELD),List(2015-10-09T00:55:23.6345Z,2015-10-09T00:55:23.6345Z),List(1400.0,800.0),List(SINGLE ,单)
def main(Args : Array[String]): Unit = {
val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val salesSchema = StructType(Array(
StructField("prodID", StringType, true),
StructField("unitOfMeasure", StringType, true),
StructField("state", StringType, true),
StructField("effectiveDateTime", StringType, true),
StructField("quantity", StringType, true),
StructField("stockKeepingLevel", StringType, true)
))
val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")
val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {
parse(eachJsonMessages)
}).map(insideEachJson=>{
implicit val formats = org.json4s.DefaultFormats
val prodID = (insideEachJson\ "level" \"productReference" \"TPNB").extract[String].toString
val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString
val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"state").extract[String]).toString()
val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"effectiveDateTime").extract[String]).toString
val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double]).
toString
val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String]).
toString
//Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
println(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)
}).collect()
// sqlContext.createDataFrame(x,salesSchema).show(truncate = false)
}
什么问题阻止了它的工作?看到适当的异常,编译器错误,无论如何都很难诊断问题。 – Phasmid
感谢您查看问题。行对象不能表示它。这就是我刚才把打印语句弄清楚的原因。我定义的Schema和我传递的Row()对象不匹配,所以我希望有任何帮助来解决这个问题。 –