不同列上的不同聚合操作pyspark

我想将不同的聚合函数应用于pyspark数据框中的不同列。继计算器一些建议，我尝试这样做：不同列上的不同聚合操作pyspark

the_columns = ["product1","product2"] 
the_columns2 = ["customer1","customer2"] 

exprs = [mean(col(d)) for d in the_columns1, count(col(c)) for c in the_columns2]

其次

df.groupby(*group).agg(*exprs)

其中，“组”是在任何the_columns或the_columns2不存在的列。这不起作用。如何在不同列上做不同的聚合函数？

来源

2017-11-04 user3490622

你很已近，而不是把表达式列表，添加它们让你有表情的平面列表：

exprs = [mean(col(d)) for d in the_columns1] + [count(col(c)) for c in the_columns2]

这里是一个演示：

import pyspark.sql.functions as F 

df.show() 
+---+---+---+---+ 
| a| b| c| d| 
+---+---+---+---+ 
| 1| 1| 2| 1| 
| 1| 2| 2| 2| 
| 2| 3| 3| 3| 
| 2| 4| 3| 4| 
+---+---+---+---+ 

cols = ['b'] 
cols2 = ['c', 'd']  

exprs = [F.mean(F.col(x)) for x in cols] + [F.count(F.col(x)) for x in cols2] 

df.groupBy('a').agg(*exprs).show() 
+---+------+--------+--------+ 
| a|avg(b)|count(c)|count(d)| 
+---+------+--------+--------+ 
| 1| 1.5|  2|  2| 
| 2| 3.5|  2|  2| 
+---+------+--------+--------+

来源

2017-11-04 01:42:03 Psidom

不同列上的不同聚合操作pyspark

回答

相关问题