2017-04-05 41 views
0

我是新的斯卡拉和火花,不知道如何爆炸“路径”字段,发现一个通过最大和最小“event_dttm”字段。 我有一个数据:如何爆炸阵列[字符串]字段和组数据一次传递

val weblog=sc.parallelize(Seq(
    ("39f0412b4c91","staticnavi.com", Seq("panel", "cm.html"), 1424954530, "SO.01"), 
    ("39f0412b4c91","staticnavi.com", Seq("panel", "cm.html"), 1424964830, "SO.01"), 
    ("39f0412b4c91","staticnavi.com", Seq("panel", "cm.html"), 1424978445, "SO.01"), 
    )).toDF("id","domain","path","event_dttm","load_src") 

我一定要得到一个结果:

"id"  | "domain" |"newPath" | "max_time" | min_time | "load_src" 
39f0412b4c91|staticnavi.com| panel | 1424978445 | 1424954530 | SO.01 
39f0412b4c91|staticnavi.com| cm.html | 1424978445 | 1424954530 | SO.01 

我认为这是通过行功能能够实现,但不知道怎么办。

回答

1

您正在寻找explode(),随后groupBy聚集:

import org.apache.spark.sql.functions.{explode, min, max} 

var result = weblog.withColumn("path", explode($"path")) 
    .groupBy("id","domain","path","load_src") 
    .agg(min($"event_dttm").as("min_time"), 
     max($"event_dttm").as("max_time")) 

result.show() 
+------------+--------------+-------+--------+----------+----------+ 
|   id|  domain| path|load_src| min_time| max_time| 
+------------+--------------+-------+--------+----------+----------+ 
|39f0412b4c91|staticnavi.com| panel| SO.01|1424954530|1424978445| 
|39f0412b4c91|staticnavi.com|cm.html| SO.01|1424954530|1424978445| 
+------------+--------------+-------+--------+----------+----------+ 
+0

谢谢!工作正常。 有没有使用爆炸的另一种方法? – Fred

+0

使用'rdd' api,但这会更加精细,可能会更慢。 – mtoto

+0

我找到了使用flatMap的解决方案:val result = weblog.flatMap {row:(id:String,domain:String,path:String,event_dttm:Long,load_src:String,ymd:String)=> { } split(“/”)。map(x =>(id,domain.concat(“#”)。concat(x),BigInt(event_dttm),load_src,ymd)) }} – Fred