火花：区分大小写partitionBy列

我试图写出来的hiveContext与分区键数据框（对于兽人格式）：火花：区分大小写partitionBy列

df.write().partitionBy("event_type").mode(SaveMode.Overwrite).orc("/path");

但是上我试图分区列有区分大小写值这是抛出一个错误而写：

Caused by: java.io.IOException: File already exists: file:/path/_temporary/0/_temporary/attempt_201607262359_0001_m_000000_0/event_type=searchFired/part-r-00000-57167cfc-a9db-41c6-91d8-708c4f7c572c.orc

event_type列同时searchFired和SearchFired作为值。但是，如果我从数据框中删除其中的一个，那么我可以成功写入。我该如何解决这个问题？

来源

2016-07-26 nish

依靠文件系统中的大小写区别通常不是一个好主意。

的解决方案是通过壳体到不同值组合成使用类似（使用的Scala DSL）相同的分区：

df 
    .withColumn("par_event_type", expr("lower(event_type)")) 
    .write 
    .partitionBy("par_event_type") 
    .mode(SaveMode.Overwrite) 
    .orc("/path")

这用于划分增加一个额外的列。如果这会导致问题，则可以使用drop在读取数据时将其删除。

来源

2016-07-26 23:14:46 Sim

火花：区分大小写partitionBy列

回答

相关问题