[PY]星火SQL：多列sessionization

给定一个正龙i和数据帧[PY]星火SQL：多列sessionization

+-----+--+--+               
|group|n1|n2|                
+-----+--+--+                
| 1| 0| 0|                
| 1| 1| 1|                
| 1| 1| 5|                
| 1| 2| 2|                
| 1| 2| 6|                
| 1| 3| 3|                
| 1| 3| 7|                
| 1| 4| 4|                
| 1| 5| 1|                
| 1| 5| 5|                
+-----+--+--+

你会如何sessionize排在同group使得对于每对连续行r1，r2中的会议r2.n1>r1.n1，r2.n2>r1.n2，和max（r2.n1 - r1.n1，r2.n2 - r1.n2）< i？请注意，n1和n2值可能不是唯一的，这意味着构成会话的行在DataFrame中可能不连续。

作为示例，对于给定的数据帧和i = 3的结果将是

+-----+--+--+-------+ 
|group|n1|n2|session| 
+-----+--+--+-------+ 
| 1| 0| 0|  1| 
| 1| 1| 1|  1| 
| 1| 1| 5|  2| 
| 1| 2| 2|  1| 
| 1| 2| 6|  2| 
| 1| 3| 3|  1| 
| 1| 3| 7|  2| 
| 1| 4| 4|  1| 
| 1| 5| 1|  3| 
| 1| 5| 5|  1| 
+-----+--+--+-------+

任何帮助或暗示将不胜感激。谢谢！

来源

2017-09-24 alan

这看起来像你想用一个相同的数字来标记一个图的所有连接部分。一个好的解决方案是使用graphframes：https://graphframes.github.io/quick-start.html

从你的数据框：

df = sc.parallelize([[1, 0, 0],[1, 1, 1],[1, 1, 5],[1, 2, 2],[1, 2, 6], 
        [1, 3, 3],[1, 3, 7],[1, 4, 4],[1, 5, 1],[1, 5, 5]]).toDF(["group","n1","n2"])

我们将创建一个包含独特id是清单顶点数据帧：

import pyspark.sql.functions as psf 
v = df.select(psf.struct("n1", "n2").alias("id"), "group") 

    +-----+-----+ 
    | id|group| 
    +-----+-----+ 
    |[0,0]| 1| 
    |[1,1]| 1| 
    |[1,5]| 1| 
    |[2,2]| 1| 
    |[2,6]| 1| 
    |[3,3]| 1| 
    |[3,7]| 1| 
    |[4,4]| 1| 
    |[5,1]| 1| 
    |[5,5]| 1| 
    +-----+-----+

而且从您陈述的布尔条件定义的边缘数据帧：

i = 3 
e = df.alias("r1").join(
    df.alias("r2"), 
    (psf.col("r1.group") == psf.col("r2.group")) 
    & (psf.col("r1.n1") < psf.col("r2.n1")) 
    & (psf.col("r1.n2") < psf.col("r2.n2")) 
    & (psf.greatest(
     psf.col("r2.n1") - psf.col("r1.n1"), 
     psf.col("r2.n2") - psf.col("r1.n2")) < i) 
).select(psf.struct("r1.n1", "r1.n2").alias("src"), psf.struct("r2.n1", "r2.n2").alias("dst")) 

    +-----+-----+ 
    | src| dst| 
    +-----+-----+ 
    |[0,0]|[1,1]| 
    |[0,0]|[2,2]| 
    |[1,1]|[2,2]| 
    |[1,1]|[3,3]| 
    |[1,5]|[2,6]| 
    |[1,5]|[3,7]| 
    |[2,2]|[3,3]| 
    |[2,2]|[4,4]| 
    |[2,6]|[3,7]| 
    |[3,3]|[4,4]| 
    |[3,3]|[5,5]| 
    |[4,4]|[5,5]| 
    +-----+-----+

现在找到所有连接的组件：

from graphframes import * 
g = GraphFrame(v, e) 
res = g.connectedComponents() 

    +-----+-----+------------+ 
    | id|group| component| 
    +-----+-----+------------+ 
    |[0,0]| 1|309237645312| 
    |[1,1]| 1|309237645312| 
    |[1,5]| 1| 85899345920| 
    |[2,2]| 1|309237645312| 
    |[2,6]| 1| 85899345920| 
    |[3,3]| 1|309237645312| 
    |[3,7]| 1| 85899345920| 
    |[4,4]| 1|309237645312| 
    |[5,1]| 1|292057776128| 
    |[5,5]| 1|309237645312| 
    +-----+-----+------------+

来源

2017-09-25 19:15:39 MaFF

谢谢你，@Marie。找到连接的组件的伎俩！不幸的是，性能很糟糕 - 花了一分多时才找到我提供的例子的解决方案！你知道这可能是为什么吗？ – alan

它迭代次数与最大分支长度一样多。你可以尝试调整你的图表，使其运行速度更快。 – MaFF

感谢您的建议。不幸的是，即使使用图形调整，糟糕的运行时间性能也会影响我的生产数据集。 – alan

[PY]星火SQL：多列sessionization

回答

相关问题