2017-04-26 125 views
1

我试图将数据帧转换为RDD。我的数据框已键入列,就像这样:如何在转换Scala Spark DF - > RDD时保留类型?

df.printSchema 
root 
|-- _c0: integer (nullable = true) 
|-- num_hits: integer (nullable = true) 
|-- session_name: string (nullable = true) 
|-- user_id: string (nullable = true) 

当我去将其转换为使用df.rdd的RDD,我得到一个RDD是类型Array[org.apache.spark.sql.Row]的,但是当我访问使用每个条目rdd(0)(0)rdd(0)(1)等。我得到他们都有Any类型。如何保持DataFrame将其转换为RDD时的相同输入?换句话说:我如何让我的rdd中的列具有类型Int,Int, String, String,以便它们与Dataframe匹配?

回答

3

您只需将您的DataFrameDataset[(Int, Int, String, String)],如

scala> val df = Seq((1, 2, "a", "b")).toDF("_c0", "num_hits", "session_name", "user_id") 
df: org.apache.spark.sql.DataFrame = [_c0: int, num_hits: int ... 2 more fields] 

scala> df.printSchema 
root 
|-- _c0: integer (nullable = false) 
|-- num_hits: integer (nullable = false) 
|-- session_name: string (nullable = true) 
|-- user_id: string (nullable = true) 


scala> val rdd = df.as[(Int, Int, String, String)].rdd 
rdd: org.apache.spark.rdd.RDD[(Int, Int, String, String)] = MapPartitionsRDD[3] at rdd at <console>:25 

如果_c0num_hits可以null,只是改变Intjava.lang.Integer

+0

这样做。谢谢! df.rdd没有选择类型是否有原因? – tSchema

+0

因为DataFrame不知道你想要什么类型。作为[(Int,Int,String,String)]'基本上只是告诉Spark你想将Row转换为'(Int,Int,String,String)' – zsxwing

相关问题