2017-04-10 90 views
0

给定一个间隔表,我可以有效地查询每个间隔开始时的当前打开间隔数(包括当前间隔本身)吗?BigQuery中重叠间隔的数量

例如,下表给出:

 
start_time end_time 
     1  10 
     2  5 
     3  4 
     5  6 
     7  11 
     19  20 

我想下面的输出:

 
start_time count 
     1  1 
     2  2 
     3  3 
     5  3 
     7  2 
     19  1 

在小数据集,我可以对自己加入该数据集解决这个问题:

WITH intervals AS (
    SELECT 1 AS start, 10 AS end UNION ALL 
    SELECT 2, 5 UNION ALL 
    SELECT 3, 4 UNION ALL 
    SELECT 5, 6 UNION ALL 
    SELECT 7, 11 UNION ALL 
    SELECT 19, 20 
) 
SELECT 
    a.start_time, 
    count(*) 
FROM 
    intervals a CROSS JOIN intervals b 
WHERE 
    a.start_time >= b.start_time AND 
    a.start_time <= b.end_time 
GROUP BY a.start_time 
ORDER BY a.start_time 

对于大型数据集,CROSS JOIN既不切实际也不必要,因为ny给出答案只取决于少数前面的区间(当按start_time排序时)。事实上,在我拥有的数据集中,它超时。有没有更好的方法来实现这一目标?

+0

u能解释输出虚拟数据? – Teja

+0

输出是从输入开始的每个时间间隔的开始时间以及在该时间间隔的开始时间的开放时间间隔(开始时间<=那个时间和结束时间> =那个时间)的计数。 –

回答

1

... CROSS JOIN既不切实际也不必要...
有没有更好的方法来实现这个目标?

请尝试下面的BigQuery标准SQL。没有加入参与

#standardSQL 
SELECT 
    start_time, 
    (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt 
FROM (
    SELECT 
    start_time, 
    ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends 
    FROM intervals 
) 
-- ORDER BY start_time 

您可以测试/使用下面的例子发挥它从你的问题

#standardSQL 
WITH intervals AS (
    SELECT 1 AS start_time, 10 AS end_time UNION ALL 
    SELECT 2, 5 UNION ALL 
    SELECT 3, 4 UNION ALL 
    SELECT 5, 6 UNION ALL 
    SELECT 7, 11 UNION ALL 
    SELECT 19, 20 
) 
SELECT 
    start_time, 
    (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt 
FROM (
    SELECT 
    start_time, 
    ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends 
    FROM intervals 
) 
-- ORDER BY start_time 
+0

@BrandonDuRette - 你有机会尝试吗? –