2016-04-03 92 views
2

我有这样的在线课程表(空行是只是为了更好的可见性):获取envelope.ie重叠的时间跨度

ip_address | start_time  | stop_time 
------------|------------------|------------------ 
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:12 
10.10.10.10 | 2016-04-02 08:11 | 2016-04-02 08:20 

10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:10 
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:08 
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:11 
10.10.10.10 | 2016-04-02 09:02 | 2016-04-02 09:15 
10.10.10.10 | 2016-04-02 09:10 | 2016-04-02 09:12 

10.66.44.22 | 2016-04-02 08:05 | 2016-04-02 08:07 
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11 

而我需要的“包围”在线时间跨度:

ip_address | full_start_time | full_stop_time 
------------|------------------|------------------ 
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:20 
10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:15 
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11 

我有此查询返回所需的结果:

WITH t AS 
    -- Determine full time-range of each IP 
    (SELECT ip_address, MIN(start_time) AS min_start_time, MAX(stop_time) AS max_stop_time FROM IP_SESSIONS GROUP BY ip_address), 
t2 AS 
    -- compose ticks 
    (SELECT DISTINCT ip_address, min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE AS ts 
    FROM t 
    CONNECT BY min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE <= max_stop_time), 
t3 AS 
    -- get all "online" ticks 
    (SELECT DISTINCT ip_address, ts 
    FROM t2 
     JOIN IP_SESSIONS USING (ip_address) 
    WHERE ts BETWEEN start_time AND stop_time), 
t4 AS 
    (SELECT ip_address, ts, 
     LAG(ts) OVER (PARTITION BY ip_address ORDER BY ts) AS previous_ts 
    FROM t3), 
t5 AS 
    (SELECT ip_address, ts, 
     SUM(DECODE(previous_ts,NULL,1,0 + (CASE WHEN previous_ts + INTERVAL '1' MINUTE <> ts THEN 1 ELSE 0 END))) 
      OVER (PARTITION BY ip_address ORDER BY ts ROWS UNBOUNDED PRECEDING) session_no 
    FROM t4) 
SELECT ip_address, MIN(ts) AS full_start_time, MAX(ts) AS full_stop_time 
FROM t5 
GROUP BY ip_address, session_no 
ORDER BY 1,2; 

不过,我关心的性能。该表有几百万行,时间分辨率是毫秒(而不是例子中给出的一分钟)。因此CTE t3会很大。有没有人有避免自我加入和“连接”的解决方案?

单个智能Analytic Function会很棒。

回答

3

也试试这个。我尽我所能对它进行了测试,我相信它涵盖了所有可能性,包括合并相邻间隔(10:15至10:30和10:30至10:40合并为一个间隔,10:15至10:40 )。它也应该是相当快的,它并没有太多用处。

with m as 
     (
     select ip_address, start_time, 
        max(stop_time) over (partition by ip_address order by start_time 
          rows between unbounded preceding and 1 preceding) as m_time 
     from ip_sessions 
     union all 
     select ip_address, NULL, max(stop_time) from ip_sessions group by ip_address 
     ), 
    n as 
     (
     select ip_address, start_time, m_time 
     from m 
     where start_time > m_time or start_time is null or m_time is null 
     ), 
    f as 
     (
     select ip_address, start_time, 
      lead(m_time) over (partition by ip_address order by start_time) as stop_time 
     from n 
     ) 
select * from f where start_time is not null 
/
+0

不错的解决方案,我也没有看到任何问题。 –

+1

@WernfriedDomscheit - 如果你仍然关心这类问题,我发现Stew Ashton在他的博客上有更好的解决方案。它应该是我的两倍。 https://stewashton.wordpress.com/2015/06/08/merging-overlapping-date-ranges/ – mathguy

+0

伟大的方法。是的,它应该更快,因为它不包含“UNION ALL”。我会测试它。 –

0

我想用lag()和累计总和将有更好的性能:

select ip_address, min(start_time) as full_start_time, 
     max(end_time) as full_end_time 
from (select t.*, 
      sum(case when prev_et >= start_time then 0 else 1 end) over 
       (partition by ip_address order by start_time) as grp 
     from (select s.*, 
        lag(end_time) over (partition by ip_address order by end_time) as prev_et 
      from ip_seesions s) 
      ) t 
group by grp, ip_address 
order by 1, 2; 

给出了结果:

ip_address | full_start_time | full_stop_time 
------------|------------------|------------------ 
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 09:15 
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:12 
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11 
10.66.44.22 | 2016-04-02 08:05 | 2016-04-02 08:07 
+0

不起作用。 IP 10.10.10.10从08:20:01至08:59:59脱机。 IP 10.66.44.22是从08:03到08:11在线的(我编辑你的查询结果的答案) –

1

请测试这个解决方案,它为你的例子,但也有可能有些情况我没有注意到。没有连接,没有自我连接。

with io as (
    select * from (
    select ip_address, t1, io, sum(io) over (partition by ip_address order by t1) sio 
     from (
     select ip_address, start_time t1, 1 io from ip_sessions 
     union all 
     select ip_address, stop_time, -1 io from ip_sessions)) 
    where (io = 1 and sio = 1) or (io = -1 and sio = 0)) 
select ip_address, t1, t2 
    from (
    select io.*, lead(t1) over (partition by ip_address order by t1) as t2 from io) 
    where io = 1 

测试数据:

create table ip_sessions (ip_address varchar2(15), start_time date, stop_time date); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 08:00:00', timestamp '2016-04-02 08:12:00'); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 08:11:00', timestamp '2016-04-02 08:20:00'); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:00:00', timestamp '2016-04-02 09:10:00'); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:05:00', timestamp '2016-04-02 09:08:00'); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:02:00', timestamp '2016-04-02 09:15:00'); 
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:10:00', timestamp '2016-04-02 09:12:00'); 
insert into ip_sessions values ('10.66.44.22', timestamp '2016-04-02 08:05:00', timestamp '2016-04-02 08:07:00'); 
insert into ip_sessions values ('10.66.44.22', timestamp '2016-04-02 08:03:00', timestamp '2016-04-02 08:11:00'); 

输出:

IP_ADDRESS T1     T2 
----------- ------------------- ------------------- 
10.10.10.10 2016-04-02 08:00:00 2016-04-02 08:20:00 
10.10.10.10 2016-04-02 09:00:00 2016-04-02 09:15:00 
10.66.44.22 2016-04-02 08:03:00 2016-04-02 08:11:00 
+0

如果你插入一行也是如此:INSERT INTO IP_SESSIONS VALUES('10.10.10.10',TIMESTAMP' 2016-04-02 09:00:00',TIMESTAMP'2016-04-02 09:16:00');' –

+0

...因为在这种情况下我们有两个会话开始于9:00。在第三行中将'union all'更改为'union'(这可能会降低性能)或在“无界前导和当前行之间”添加行。 –

+0

UNION而不是UNION ALL将不起作用,如果从9:00到9:12以及从9:00到9:15有两个时间间隔,您将选择较短的时间间隔并错过9:12至9:15间隔。建议:尽量不要对表和列使用相同的名称(io)。还有一点,这个解决方案可能会错过9点到9点12分和9点12分到9点18分;据推测,结果应该是9点到9:18。可能的修复 - 在sio的定义中,在over子句中,将顺序更改为“order by t1,io desc”。 – mathguy

0

在我结束了其满足我的要求的函数结束。 我想,它和思考斯蒂本斯的答案一样。

CREATE OR REPLACE TYPE SESSION_REC AS OBJECT (START_TIME TIMESTAMP_UNCONSTRAINED, STOP_TIME TIMESTAMP_UNCONSTRAINED); 
CREATE OR REPLACE TYPE SESSION_TYPE AS TABLE OF SESSION_REC; 
CREATE OR REPLACE TYPE TIMESTAMP_TAB AS TABLE OF TIMESTAMP_UNCONSTRAINED; 

CREATE OR REPLACE FUNCTION ENVELOP_SESSIONS(v_ipaddress IN VARCHAR2) 
    RETURN SESSION_TYPE PIPELINED IS 

    rec SESSION_REC; 
    startTimes TIMESTAMP_TAB; 
    stopTimes TIMESTAMP_TAB; 

    TYPE ActionRecType IS RECORD (TS TIMESTAMP_UNCONSTRAINED, ACTION INTEGER); 
    TYPE ActionTableType IS TABLE OF ActionRecType; 
    actions ActionTableType; 
    onlineCount INTEGER := 0; 

BEGIN 

    SELECT START_TIME, STOP_TIME 
    BULK COLLECT INTO startTimes, stopTimes 
    FROM IP_SESSIONS 
    WHERE IP_ADDRESS = v_ipaddress; 

    WITH t AS 
     (SELECT COLUMN_VALUE AS ts, 1 AS action 
     FROM TABLE(startTimes) 
     UNION ALL 
     SELECT COLUMN_VALUE AS ts, -1 AS action 
     FROM TABLE(stopTimes)) 
    SELECT ts, action 
    BULK COLLECT INTO actions 
    FROM t 
    ORDER BY ts, action; 

    IF actions.COUNT > 0 THEN 
     FOR i IN actions.FIRST..actions.LAST LOOP  
      IF onlineCount = 0 AND actions(i).ACTION = 1 THEN 
       -- session starts 
       rec := SESSION_REC(actions(i).TS, NULL); 
      ELSIF onlineCount = 1 AND actions(i).ACTION = -1 THEN 
       -- session ends 
       rec := SESSION_REC(rec.START_TIME, actions(i).TS); 
       PIPE ROW(rec); 
      END IF; 
      onlineCount := onlineCount + actions(i).ACTION; 
     END LOOP;  
    END IF; 
    RETURN;  

END ENVELOP_SESSIONS;