0

自定义对象我想叫API,它预计Employee对象,如下图所示:转换火花数据集在斯卡拉

public class EmployeeElements { 
    private Set<Long> eIds; 
    private Map<Long, List<EmployeeDetails>> employeeDetails; 
    private Map<Long, List<Address>> address; 
} 

我想创建格式对象:Map[Long,List[CustomObject]]了数据集

由于输入我们将有以下数据框。

EmployeeDF 

+---------+-------------------+ 
|emp_Id |hired_Date   | 
+---------+-------------------+ 
|1387779 |2016-09-27 19:47:28| 
|1387781 |2016-09-27 19:47:28| 
|1387780 |2016-09-27 19:47:28| 
|1387780 |2016-09-27 19:47:28| 
|1387777 |2016-09-27 19:47:28| 
|1387778 |2016-09-27 19:47:28| 
+---------+-------------------+ 

EmployeeDetailsDF: 

+---------+---------+--------+----------+-------------------+ 
|emp_Id |firstname|lastname|gendercode|dateofbirth  | 
+---------+---------+--------+----------+-------------------+ 
|1387777 |Jon  |Snow |F   |1985-01-01 00:00:00| 
|1387778 |Jon  |Snow |M   |1985-01-01 00:00:00| 
|1387779 |Jon  |Snow |F   |1985-01-01 00:00:00| 
|1387780 |Jon  |Snow |F   |1985-01-01 00:00:00| 
|1387781 |Jon  |Snow |F   |1985-01-01 00:00:00| 
+---------+---------+--------+----------+-------------------+ 

AddressDf: 

+---------+------------------+-----------------+-------------------+ 
|emp_Id |patient_address_id|Country   |joined_Date  | 
+---------+------------------+-----------------+-------------------+ 
|1387779 |2435146   |USA    |2016-09-27 19:47:28| 
|1387781 |2435148   |AUS    |2016-09-27 19:47:28| 
|1387780 |2435147   |USA    |2016-09-27 19:47:28| 
|1387780 |2435149   |UK    |2016-09-27 19:47:28| 
|1387777 |2435144   |AUS    |2016-09-27 19:47:28| 
|1387778 |2435145   |USA    |2016-09-27 19:47:28| 
+---------+------------------+-----------------+-------------------+ 

EmployeeDetailsDF将获得参加与EmployeeDF。 同样AddressDf也将加入EmployeeDF

Now out of that join Dataframe I wanted to create `Map[Long,List[CustomObject]]`, for example: couple of rows of joindDataFrame of AddressDf and EmployeeDF 

(1387777,List([1387777,2435144,AUS])) 
(1387780,List([1387780,2435147,USA], [1387780,2435149,UK])) 

Exception: 
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.GroupedDataset cannot be cast to scala.collection.immutable.Map 
when I am trying to cast groupAddressDs to Map: 
//empAddressDs is the joined Dataset of AddressDf and EmployeeDF 
val groupAddressDs:GroupedDataset[Long, Address] = empAddressDs.groupBy { x => x.emp_Id }  
val addressMap = groupAddressDs.asInstanceOf[Map[String, List[Procedure]]] 

但我没有得到同样的任何解决方案,我试图与GroupedDataset但最终我们无法施展它映射对象它给人类转换异常, 然后用pairRdd尝试过,但后来我我无法将rdd的行再次转换为我的自定义对象。

我必须处理EmployeeElements对象中的数百万个数据和数百个自定义对象。

+0

您是否试图将百万数据编组为一个EmployeeElements对象?你最终需要用EmployeeElements做什么? – josephpconley

+0

@josephpconley:我想将EmployeeElements对象传递给外部API。 – Kalpesh

回答

0
Instead of doing asInstanceOf on group data set execute mapGroups function on groupDataset 
Foe e.g. 
groupAddressDs.mapGroups { case (k, xs) => { 
       var employeeElements = new EmployeeElements() 
    employeeElements.setAddress(mapAsJavaMap(Map[Long,List[Address]](k -> xs.toList)).asInstanceOf[Long, List[Address]]]) 
    employeeElements.setEIds(k) 
(k, xs.toSeq) 
} 
}