使用Python的reduce（）加入多个PySpark数据框

有谁知道为什么在加入多个PySpark数据框时使用Python3的functools.reduce()会导致性能下降，而不是使用for循环迭代地加入相同的数据框？而这一个没有使用Python的reduce（）加入多个PySpark数据框

def join_dataframes(list_of_join_columns, left_df, right_df): 
    return left_df.join(right_df, on=list_of_join_columns) 

joined_df = functools.reduce(
    functools.partial(join_dataframes, list_of_join_columns), list_of_dataframes, 
)

：

joined_df = list_of_dataframes[0] 
joined_df.cache() 
for right_df in list_of_dataframes[1:]: 
    joined_df = joined_df.join(right_df, on=list_of_join_columns)

任何想法将不胜感激具体而言，这给出了一个巨大的减速，接着存储器外的错误。谢谢！

来源

2017-07-07 Eric Smith

一个原因是减少或折叠通常在功能上是纯粹的：每个累积操作的结果不会写入内存的同一部分，而是写入新的内存块。

原则上垃圾收集器可以在每次累加后释放前一个块，但如果没有，则会为累加器的每个更新版本分配内存。

来源

2017-07-07 18:45:43 Alex

使用Python的reduce（）加入多个PySpark数据框

回答

相关问题