Spark数据集agg方法

我正在使用Spark和dataSet API创建一些分析数据集。我得分手在那里我calcuating一些变量，它看起来是这样的：Spark数据集agg方法

CntDstCdrs1.groupByKey(x => (x.bs_recordid, x.bs_utcdate)).agg(
    count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_1" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_1day"), 
    count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_3" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_3day_cust"), 
    count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_5" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_5day_cust"), 
    count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_7" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_7day_cust") 
).show()

此代码工作正常，但是当我尝试添加一个计数变量“count_phone_30day”我得到一个错误.. “方法重载...” 这意味着dataSet上的agg方法签名最多需要4个Column表达式？无论如何，如果这种方法不是计算大量变量的最佳实践，那么哪一个会是？我有数，统计不同的总和等变量。

KR，斯特凡

来源

2017-09-23 StStojanovic

的'方法overloaded'错误可能是别的东西造成的，如'agg'上'Dataset'可以采取比4路以上在'when'条件下聚合函数。 –

@LeoC它可以，但在关系'groupBy'中，键值'groupByKey'具有其他实现 –

Dataset.groupByKey回报KeyValueGroupedDataset。

这个类没有agg可变参数 - 你可以只提供4列作为参数

来源

2017-09-24 17:31:02

Spark数据集agg方法

回答

相关问题