2016-03-14 182 views
9

试图删除DataFrame中的列,但我列出了其中包含点的列名,这些列名都是我逃过的。Spark 1.6:在DataFrame中删除列,并使用转义列名

我逃离之前,我的模式是这样的:

root 
|-- user_id: long (nullable = true) 
|-- hourOfWeek: string (nullable = true) 
|-- observed: string (nullable = true) 
|-- raw.hourOfDay: long (nullable = true) 
|-- raw.minOfDay: long (nullable = true) 
|-- raw.dayOfWeek: long (nullable = true) 
|-- raw.sensor2: long (nullable = true) 

如果我试图删除列,我得到:

df = df.drop("hourOfWeek") 
org.apache.spark.sql.AnalysisException: cannot resolve 'raw.hourOfDay' given input columns raw.dayOfWeek, raw.sensor2, observed, raw.hourOfDay, hourOfWeek, raw.minOfDay, user_id; 
     at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) 
     at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) 
     at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) 
     at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) 
     at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) 
     at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) 

请注意,我不甚至还试图砸名字中有点的列。 因为我似乎不能做太多不逃逸的列名,我转换架构:

root 
|-- user_id: long (nullable = true) 
|-- hourOfWeek: string (nullable = true) 
|-- observed: string (nullable = true) 
|-- `raw.hourOfDay`: long (nullable = true) 
|-- `raw.minOfDay`: long (nullable = true) 
|-- `raw.dayOfWeek`: long (nullable = true) 
|-- `raw.sensor2`: long (nullable = true) 

,但似乎并没有帮助。我仍然得到同样的错误。

我试着转义所有列名称,并使用转义名称,但这也不起作用。

root 
|-- `user_id`: long (nullable = true) 
|-- `hourOfWeek`: string (nullable = true) 
|-- `observed`: string (nullable = true) 
|-- `raw.hourOfDay`: long (nullable = true) 
|-- `raw.minOfDay`: long (nullable = true) 
|-- `raw.dayOfWeek`: long (nullable = true) 
|-- `raw.sensor2`: long (nullable = true) 

df.drop("`hourOfWeek`") 
org.apache.spark.sql.AnalysisException: cannot resolve 'user_id' given input columns `user_id`, `raw.dayOfWeek`, `observed`, `raw.minOfDay`, `raw.hourOfDay`, `raw.sensor2`, `hourOfWeek`; 
     at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) 
     at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) 

是否有另一种方法可以删除不会对此类型数据失败的列?

回答

15

好吧,我似乎已经找到了解决办法毕竟:

df.drop(df.col("raw.hourOfWeek"))似乎工作

+0

有用的答案。但我还有一个类似的问题。假设我在Spark Dataframe中有大约100列。有什么办法从这个数据框中只选择几列,并用这些选定的列创建另一个数据框?像df2 = df1.select(df.col(“col1”,“col2”)) – JKC

+0

我认为这个https://stackoverflow.com/questions/36131716/scala-spark-dataframe-dataframe-select-multiple-columns -given -a-sequence-of-co回答你的问题 – MrE

0
val data = df.drop("Customers"); 

将正常工作正常列

val new = df.drop(df.col("old.column")); 
+0

重点在于名称中带有点的列。 – MrE

+0

感谢您指出@MrE –

相关问题