2017-08-27 139 views
4

问题陈述

我有表 “event_statistics”,定义如下:汇总查询

CREATE TABLE public.event_statistics (
    id int4 NOT NULL DEFAULT nextval('event_statistics_id_seq'::regclass), 
    client_id int4 NULL, 
    session_id int4 NULL, 
    action_name text NULL, 
    value text NULL, 
    product_id int8 NULL, 
    product_options jsonb NOT NULL DEFAULT '{}'::jsonb, 
    url text NULL, 
    url_options jsonb NOT NULL DEFAULT '{}'::jsonb, 
    visit int4 NULL DEFAULT 0, 
    date_update timestamptz NULL, 
CONSTRAINT event_statistics_pkey PRIMARY KEY (id), 
CONSTRAINT event_statistics_client_id_session_id_sessions_client_id_id_for 
FOREIGN KEY 
(client_id,session_id) REFERENCES <?>() ON DELETE CASCADE ON UPDATE CASCADE 
) 
WITH (
    OIDS=FALSE 
) ; 
CREATE INDEX regdate ON public.event_statistics (date_update 
timestamptz_ops) ; 

而且表 “客户”:

CREATE TABLE public.clients (
    id int4 NOT NULL DEFAULT nextval('clients_id_seq'::regclass), 
    client_name text NULL, 
    client_hash text NULL, 
CONSTRAINT clients_pkey PRIMARY KEY (id) 
) 
WITH (
    OIDS=FALSE 
) ; 
CREATE INDEX clients_client_name_idx ON public.clients (client_name 
text_ops) ; 

我需要什么是为每个“action_name”类型的“event_statistics”表中的事件计数,以针对特定的“date_update”范围分组由“action_name”和特定时间步骤以及特定客户端的所有时间步骤组成。

的目标是为所有相关活动为每一个客户对他的仪表盘在我们的网站有选项可以选择报告的日期,取决于图表间隔时间步长的统计数据应该是不同的,如:

  • 电流每天计数一小时;
  • 1天以上和< = 1个月 - 每天计数;
  • 1+个月和< = 6个月 - 每周计数;
  • 6个月以上 - 月。

我做了什么:

SELECT t.date, A.actionName, count(E.id) 
FROM generate_series(current_date - interval '1 week',now(),interval '1 
day') as t(date) cross join 
(values 
('page_open'), 
('product_add'), 
('product_buy'), 
('product_event'), 
('product_favourite'), 
('product_open'), 
('product_share'), 
('session_start')) as A(actionName) left join 
(select action_name,date_trunc('day',e.date_update) as dateTime, e.id 
from event_statistics as e 
where e.client_id = (select id from clients as c where c.client_name = 
'client name') and 
(date_update between (current_date - interval '1 week') and now())) E 
on t.date = E.dateTime and A.actionName = E.action_name 
group by A.actionName,t.date 
order by A.actionName,t.date; 

它的时间太长,超过10秒,以计算事件类型和日期,上周的事件。我需要它能够以不同的组间隔(当前每天的小时,每月的几天,几周,几个月)的更快速度和更长的时间周期来执行相同的操作。

查询计划:

GroupAggregate (cost=171937.16..188106.84 rows=1600 width=44) 
    Group Key: "*VALUES*".column1, t.date 
    InitPlan 1 (returns $0) 
    -> Seq Scan on clients c (cost=0.00..1.07 rows=1 width=4) 
      Filter: (client_name = 'client name'::text) 
    -> Merge Left Join (cost=171936.08..183784.31 rows=574060 width=44) 
     Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date =(date_trunc('day'::text, e.date_update)))) 
     -> Sort (cost=628.77..648.77 rows=8000 width=40) 
       Sort Key: "*VALUES*".column1, t.date 
       -> Nested Loop (cost=0.02..110.14 rows=8000 width=40) 
        -> Function Scan on generate_series t (cost=0.02..10.02 rows=1000 width=8) 
        -> Materialize (cost=0.00..0.14 rows=8 width=32) 
          -> Values Scan on "*VALUES*" (cost=0.00..0.10 rows=8 width=32) 
     -> Materialize (cost=171307.32..171881.38 rows=114812 width=24) 
       -> Sort (cost=171307.32..171594.35 rows=114812 width=24) 
        Sort Key: e.action_name, (date_trunc('day'::text, e.date_update)) 
        -> Index Scan using regdate on event_statistics e (cost=0.57..159302.49 rows=114812 width=24) 
          Index Cond: ((date_update > (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= now())) 
          Filter: (client_id = $0) 

的“event_statistics”表中有超过50个百万行的,它只会增加与客户并将该记录将不被改变。

我已经尝试了很多不同的查询计划和索引,但在聚合更广泛的日期范围时无法达到可接受的速度。 我花了整整一个星期的学习这个问题的不同方面和方法来解决这个计算器上和一些博客,但仍不能确定什么是最好的方法:

  • 分区由CLIENT_ID或日期范围
  • 预聚合到分离的结果表,然后每天更新它(也不知道如何做到最好?触发插入原始表或安排一个单独的应用程序的物化视图或从网站的请求)
  • 将DB架构设计更改为每个客户端的架构或应用分片
  • 更改服务器硬件(CPU I ntel至强E7-4850 2.00GHz,RAM 6GB,它是网络应用程序和数据库的主机)
  • 使用不同的数据库进行OLAP功能分析,如Postgres-XL 或其他?

我也尝试了event_statistics(client_id asc,action_name asc,date_update asc,id)上的btree索引。而且索引只扫描速度更快,但仍然不够,而且在磁盘空间使用方面也不是很好。

解决此问题的最佳方法是什么?

更新

按照要求,explain (analyze, verbose)命令的输出:

GroupAggregate (cost=860934.44..969228.46 rows=1600 width=44) (actual time=52388.678..54671.187 rows=64 loops=1) 
    Output: t.date, "*VALUES*".column1, count(e.id) 
    Group Key: "*VALUES*".column1, t.date 
    InitPlan 1 (returns $0) 
    -> Seq Scan on public.clients c (cost=0.00..1.07 rows=1 width=4) (actual time=0.058..0.059 rows=1 loops=1) 
      Output: c.id 
      Filter: (c.client_name = 'client name'::text) 
      Rows Removed by Filter: 5 
    -> Merge Left Join (cost=860933.36..940229.77 rows=3864215 width=44) (actual time=52388.649..54388.698 rows=799737 loops=1) 
     Output: t.date, "*VALUES*".column1, e.id 
     Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date = (date_trunc('day'::text, e.date_update)))) 
     -> Sort (cost=628.77..648.77 rows=8000 width=40) (actual time=0.190..0.244 rows=64 loops=1) 
       Output: t.date, "*VALUES*".column1 
       Sort Key: "*VALUES*".column1, t.date 
       Sort Method: quicksort Memory: 30kB 
       -> Nested Loop (cost=0.02..110.14 rows=8000 width=40) (actual time=0.059..0.080 rows=64 loops=1) 
        Output: t.date, "*VALUES*".column1 
        -> Function Scan on pg_catalog.generate_series t (cost=0.02..10.02 rows=1000 width=8) (actual time=0.043..0.043 rows=8 loops=1) 
          Output: t.date 
          Function Call: generate_series(((('now'::cstring)::date - '7 days'::interval))::timestamp with time zone, now(), '1 day'::interval) 
        -> Materialize (cost=0.00..0.14 rows=8 width=32) (actual time=0.002..0.003 rows=8 loops=8) 
          Output: "*VALUES*".column1 
          -> Values Scan on "*VALUES*" (cost=0.00..0.10 rows=8 width=32) (actual time=0.004..0.005 rows=8 loops=1) 
           Output: "*VALUES*".column1 
     -> Materialize (cost=860304.60..864168.81 rows=772843 width=24) (actual time=52388.441..54053.748 rows=799720 loops=1) 
       Output: e.id, e.date_update, e.action_name, (date_trunc('day'::text, e.date_update)) 
       -> Sort (cost=860304.60..862236.70 rows=772843 width=24) (actual time=52388.432..53703.531 rows=799720 loops=1) 
        Output: e.id, e.date_update, e.action_name, (date_trunc('day'::text, e.date_update)) 
        Sort Key: e.action_name, (date_trunc('day'::text, e.date_update)) 
        Sort Method: external merge Disk: 39080kB 
        -> Index Scan using regdate on public.event_statistics e (cost=0.57..753018.26 rows=772843 width=24) (actual time=31.423..44284.363 rows=799720 loops=1) 
          Output: e.id, e.date_update, e.action_name, date_trunc('day'::text, e.date_update) 
          Index Cond: ((e.date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (e.date_update <= now())) 
          Filter: (e.client_id = $0) 
          Rows Removed by Filter: 2983424 
Planning time: 7.278 ms 
Execution time: 54708.041 ms 
+1

疼痛似乎在对低基数文本列action_name进行排序。 (个人而言,我更喜欢数字action_id这里)另外:(func)日历和(值)action_name preudo表都没有可用的优化(索引,统计)挂钩,我会实现它们到(TEMP )表格 – wildplasser

+0

谢谢你的提示。是的,这个问题似乎是由于外部磁盘缓慢排序和读取所有客户端的数据。但由于某种原因,我无法消除排序的需要,即使有一个覆盖索引,就像我在帖子末尾写的那样。只有当我充分增加“work_mem”并且使用了内存中的排序,但由于读取“event_statistics”表的速度缓慢,这样的索引速度要快得多。 – atikeen

+0

您可以在子查询中预先聚合IMO。它不会产生超过1600个总量。 – wildplasser

回答

0

第一步:在子查询中进行预聚合:


EXPLAIN 
SELECT cal.theday, act.action_name, SUM(sub.the_count) 
FROM generate_series(current_date - interval '1 week', now(), interval '1 
day') as cal(theday) -- calendar pseudo-table 
CROSS JOIN (VALUES 
     ('page_open') 
     , ('product_add') , ('product_buy') , ('product_event') 
     , ('product_favourite') , ('product_open') , ('product_share') , ('session_start') 
     ) AS act(action_name) 
LEFT JOIN (
     SELECT es.action_name, date_trunc('day',es.date_update) as theday 
       , COUNT(DISTINCT es.id) AS the_count 
     FROM event_statistics as es 
     WHERE es.client_id = (SELECT c.id FROM clients AS c 
         WHERE c.client_name = 'client name') 
     AND (es.date_update BETWEEN (current_date - interval '1 week') AND now()) 
     GROUP BY 1,2 
     ) sub ON cal.theday = sub.theday AND act.action_name = sub.action_name 
GROUP BY act.action_name,cal.theday 
ORDER BY act.action_name,cal.theday 
     ; 

下一步:将VALUES放入CTE中,并在聚合子查询中引用它。 (增益取决于可被忽略的动作名称的数量)


EXPLAIN 
WITH act(action_name) AS (VALUES 
     ('page_open') 
     , ('product_add') , ('product_buy') , ('product_event') 
     , ('product_favourite') , ('product_open') , ('product_share') , ('session_start') 
     ) 
SELECT cal.theday, act.action_name, SUM(sub.the_count) 
FROM generate_series(current_date - interval '1 week', now(), interval '1day') AS cal(theday) 
CROSS JOIN act 
LEFT JOIN (
     SELECT es.action_name, date_trunc('day',es.date_update) AS theday 
       , COUNT(DISTINCT es.id) AS the_count 
     FROM event_statistics AS es 
     WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now() 
     AND EXISTS (SELECT * FROM clients cli WHERE cli.id= es.client_id AND cli.client_name = 'client name') 
     AND EXISTS (SELECT * FROM act WHERE act.action_name = es.action_name) 
     GROUP BY 1,2 
     ) sub ON cal.theday = sub.theday AND act.action_name = sub.action_name 
GROUP BY act.action_name,cal.theday 
ORDER BY act.action_name,cal.theday 
     ; 

更新:使用fysical(TEMP)表将导致更好的估计。


-- Final attempt: materialize the carthesian product (timeseries*action_name) 
    -- into a temp table 
CREATE TEMP TABLE grid AS 
(SELECT act.action_name, cal.theday 
FROM generate_series(current_date - interval '1 week', now(), interval '1 day') 
    AS cal(theday) 
CROSS JOIN 
    (VALUES ('page_open') 
     , ('product_add') , ('product_buy') , ('product_event') 
     , ('product_favourite') , ('product_open') , ('product_share') , ('session_start') 
     ) act(action_name) 
    ); 
CREATE UNIQUE INDEX ON grid(action_name, theday); 

    -- Index will force statistics to be collected 
    -- ,and will generate better estimates for the numbers of rows 
CREATE INDEX iii ON event_statistics (action_name, date_update) ; 
VACUUM ANALYZE grid; 
VACUUM ANALYZE event_statistics; 

EXPLAIN 
SELECT grid.action_name, grid.theday, SUM(sub.the_count) AS the_count 
FROM grid 
LEFT JOIN (
     SELECT es.action_name, date_trunc('day',es.date_update) AS theday 
       , COUNT(*) AS the_count 
     FROM event_statistics AS es 
     WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now() 
     AND EXISTS (SELECT * FROM clients cli WHERE cli.id= es.client_id AND cli.client_name = 'client name') 
     -- AND EXISTS (SELECT * FROM grid WHERE grid.action_name = es.action_name) 
     GROUP BY 1,2 
     ORDER BY 1,2 --nonsense! 
     ) sub ON grid.theday = sub.theday AND grid.action_name = sub.action_name 
GROUP BY grid.action_name,grid.theday 
ORDER BY grid.action_name,grid.theday 
     ; 

更新#3(对不起,我创建的基表(S)这里的索引,你需要进行编辑。我还删除了一列onthetimestamp)


-- attempt#4: 
    -- - materialize the carthesian product (timeseries*action_name) 
    -- - sanitize date interval -logic 

CREATE TEMP TABLE grid AS 
(SELECT act.action_name, cal.theday::date 
FROM generate_series(current_date - interval '1 week', now(), interval '1 day') 
    AS cal(theday) 
CROSS JOIN 
    (VALUES ('page_open') 
     , ('product_add') , ('product_buy') , ('product_event') 
     , ('product_favourite') , ('product_open') , ('product_share') , ('session_start') 
     ) act(action_name) 
    ); 

    -- Index will force statistics to be collected 
    -- ,and will generate better estimates for the numbers of rows 
-- CREATE UNIQUE INDEX ON grid(action_name, theday); 
-- CREATE INDEX iii ON event_statistics (action_name, date_update) ; 
CREATE UNIQUE INDEX ON grid(theday, action_name); 
CREATE INDEX iii ON event_statistics (date_update, action_name) ; 
VACUUM ANALYZE grid; 
VACUUM ANALYZE event_statistics; 

EXPLAIN 
SELECT gr.action_name, gr.theday 
      , COUNT(*) AS the_count 
FROM grid gr 
LEFT JOIN event_statistics AS es 
    ON es.action_name = gr.action_name 
    AND date_trunc('day',es.date_update)::date = gr.theday 
    AND es.date_update BETWEEN (current_date - interval '1 week') AND current_date 
JOIN clients cli ON cli.id= es.client_id AND cli.client_name = 'client name' 
GROUP BY gr.action_name,gr.theday 
ORDER BY 1,2 
     ; 

                 QUERY PLAN                   
---------------------------------------------------------------------------------------------------------------------------------------------------------- 
GroupAggregate (cost=8.33..8.35 rows=1 width=17) 
    Group Key: gr.action_name, gr.theday 
    -> Sort (cost=8.33..8.34 rows=1 width=17) 
     Sort Key: gr.action_name, gr.theday 
     -> Nested Loop (cost=1.40..8.33 rows=1 width=17) 
       -> Nested Loop (cost=1.31..7.78 rows=1 width=40) 
        Join Filter: (es.client_id = cli.id) 
        -> Index Scan using clients_client_name_key on clients cli (cost=0.09..2.30 rows=1 width=4) 
          Index Cond: (client_name = 'client name'::text) 
        -> Bitmap Heap Scan on event_statistics es (cost=1.22..5.45 rows=5 width=44) 
          Recheck Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date)) 
          -> Bitmap Index Scan on iii (cost=0.00..1.22 rows=5 width=0) 
           Index Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date)) 
       -> Index Only Scan using grid_theday_action_name_idx on grid gr (cost=0.09..0.54 rows=1 width=17) 
        Index Cond: ((theday = (date_trunc('day'::text, es.date_update))::date) AND (action_name = es.action_name)) 
(15 rows) 
+0

我测试了您的解决方案并检查了查询计划。谢谢你的努力,但不幸的是你提出的查询比我的执行速度慢了一点(+ 1-2秒),因为它执行了更多的检查和连接。我无法从“action_name”上的过滤器中受益,因为我需要为可能位于“event_statistics”表中的所有这些表进行聚合,并且我将它们手动列为VALUES,因为从表中获取不同的“action_name”值非常非常慢。 – atikeen

+0

尝试使用action_names {id,action_name}作为维度表,将action_name替换为数字“action_id integer not null FOREIGN KEY引用action_name id”。并添加一些可用的(复合)索引。 – joop

+0

仍然缓慢。当有很多时间戳和动作名称组合时,索引临时表会有所帮助,但在当前条件下并没有什么区别。否则,我会这样做。并且我早先尝试过的(action_name,date_update)上的索引,但它没有帮助,虽然我应用了“真空分析”。优化器几乎总是选择seq。扫描该索引。 – atikeen