2017-10-21 136 views
0

我写一个斯卡拉/火花程序,会发现该雇员的薪水最高。员工数据可以CSV文件形式提供,而薪金列有数千个逗号分隔符,并且还有一个$前缀,例如$ 74,628.00。星火错误:异常线程“main” java.lang.UnsupportedOperationException

为了解决这个逗号和美元符号,我已经用Scala编写这将分割每行一个解析器功能“”然后每一列映射到各个变量被分配到一个案例类。

我的解析器程序看起来像下面。为了消除逗号和美元符号,我使用替换函数将其替换为空,然后最终将类型转换为Int。

def ParseEmployee(line: String): Classes.Employee = { 
    val fields = line.split(",") 
    val Name = fields(0) 
    val JOBTITLE = fields(2) 
    val DEPARTMENT = fields(3) 
    val temp = fields(4) 

    temp.replace(",","")//To eliminate the , 
    temp.replace("$","")//To remove the $ 
    val EMPLOYEEANNUALSALARY = temp.toInt //Type cast the string to Int 

    Classes.Employee(Name, JOBTITLE, DEPARTMENT, EMPLOYEEANNUALSALARY) 
    } 

我的情况类看起来像下面

case class Employee (Name: String, 
         JOBTITLE: String, 
        DEPARTMENT: String, 
        EMPLOYEEANNUALSALARY: Number, 
) 

我的火花数据帧的SQL查询看起来像下面

val empMaxSalaryValue = sc.sqlContext.sql("Select Max(EMPLOYEEANNUALSALARY) From EMP") 
empMaxSalaryValue.show 

,当我运行这个程序我得到这个下面例外

Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for Number 
- field (class: "java.lang.Number", name: "EMPLOYEEANNUALSALARY") 
- root class: "Classes.Employee" 
    at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:625) 
    at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619) 
    at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607) 
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) 
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) 
    at scala.collection.immutable.List.foreach(List.scala:381) 
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) 
    at scala.collection.immutable.List.flatMap(List.scala:344) 
    at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607) 
    at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438) 
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71) 
    at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) 
    at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:282) 
    at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:272) 
    at CalculateMaximumSalary$.main(CalculateMaximumSalary.scala:27) 
    at CalculateMaximumSalary.main(CalculateMaximumSalary.scala) 
  1. 任何想法,为什么我收到此错误?我在这里做的错误是什么,以及为什么它不能对数字进行类型转换?

  2. 有没有更好的方法来处理得到员工的最高薪水的这个问题呢?

+0

请致电ParseEmployee功能 –

回答

0

Spark SQL只提供有限数量的Encoders,它们定位具体的类。抽象类如Number不受支持(可与限制二进制Encoders一起使用)。

既然你转换为Int无论如何,只是重新定义类:

case class Employee (
    Name: String, 
    JOBTITLE: String, 
    DEPARTMENT: String, 
    EMPLOYEEANNUALSALARY: Int 
) 
+0

提供代码'BigDecimal'是钱更适合,因为它可以与整数和小数有两种处理准确性 –

相关问题