postgresql 9.6.4：大型表上的时间戳范围查询需要永久

我需要一些帮助来分析在包含83660.142万行的大型表上执行的查询的性能不佳，这需要25分钟到一个多小时，取决于系统负载，用于计算。postgresql 9.6.4：大型表上的时间戳范围查询需要永久

我创建了下表，它由一个复合键和3个指标：

CREATE TABLE IF NOT EXISTS ds1records(
userid INT DEFAULT 0, 
clientid VARCHAR(255) DEFAULT '', 
ts TIMESTAMP, 
site VARCHAR(50) DEFAULT '', 
code VARCHAR(400) DEFAULT ''); 

CREATE UNIQUE INDEX IF NOT EXISTS primary_idx ON records (userid, clientid, ts, site, code); 
CREATE INDEX IF NOT EXISTS userid_idx ON records (userid); 
CREATE INDEX IF NOT EXISTS ts_idx ON records (ts); 
CREATE INDEX IF NOT EXISTS userid_ts_idx ON records (userid ASC,ts DESC);

在春天批处理应用程序，我执行一个查询，如下所示：

SELECT * 
    FROM records 
WHERE userid = ANY(VALUES (2), ..., (96158 more userids)) 
    AND (ts < '2017-09-02' AND ts >= '2017-09-01' 
     OR ts < '2017-08-26' AND ts >= '2017-08-25' 
     OR ts < '2017-08-19' AND ts >= '2017-08-18' 
     OR ts < '2017-08-12' AND ts >= '2017-08-11')

用户ID在运行时确定（id的数字在95.000和110.000之间）。对于每个用户，我需要提取当天的日期和最后三个工作日的页面浏览量。查询总是返回3-4M行之间的行。

使用EXPLAIN ANALYZE选项执行查询将返回以下执行计划。

Nested Loop (cost=1483.40..1246386.43 rows=3761735 width=70) (actual time=108.856..1465501.596 rows=3643240 loops=1) 
    -> HashAggregate (cost=1442.38..1444.38 rows=200 width=4) (actual time=33.277..201.819 rows=96159 loops=1) 
    Group Key: "*VALUES*".column1 
    -> Values Scan on "*VALUES*" (cost=0.00..1201.99 rows=96159 width=4) (actual time=0.006..11.599 rows=96159 loops=1) 
    -> Bitmap Heap Scan on records (cost=41.02..6224.01 rows=70 width=70) (actual time=8.865..15.218 rows=38 loops=96159) 
    Recheck Cond: (userid = "*VALUES*".column1) 
    Filter: (((ts < '2017-09-02 00:00:00'::timestamp without time zone) AND (ts >= '2017-09-01 00:00:00'::timestamp without time zone)) OR ((ts < '2017-08-26 00:00:00'::timestamp without time zone) AND (ts >= '2017-08-25 00:00:00'::timestamp without time zone)) OR ((ts < '2017-08-19 00:00:00'::timestamp without time zone) AND (ts >= '2017-08-18 00:00:00'::timestamp without time zone)) OR ((ts < '2017-08-12 00:00:00'::timestamp without time zone) AND (ts >= '2017-08-11 00:00:00'::timestamp without time zone))) 
    Rows Removed by Filter: 792 
    Heap Blocks: exact=77251145 
    -> Bitmap Index Scan on userid_ts_idx (cost=0.00..41.00 rows=1660 width=0) (actual time=6.593..6.593 rows=830 loops=96159) 
      Index Cond: (userid = "*VALUES*".column1)

我已经调整了一些Postgres的调整参数（可惜没有成功）的值：

effective_cache_size = 15GB（可能是无用的查询只执行一次）
的shared_buffers = 15GB
work_mem = 3GB

该应用程序运行计算上昂贵的任务（例如。数据融合/数据注入），并消耗大约100GB的内存，所以系统硬件的尺寸足够大，配备125GB RAM和16个内核（操作系统：Debian）。

我想知道为什么postgres在其执行计划中不使用组合索引userid_ts_idx？由于索引中的时间戳列以相反的顺序排序，因此我希望postgres使用它来查找查询范围部分的匹配元组，因为它可以顺序遍历索引，直到条件ts < '2017-09-02 00:00:00为真，并返回所有值直到条件为止符合ts >= 2017-09-01 00:00:00。相反，postgres使用昂贵的位图堆扫描，如果我理解正确，它会进行线性表扫描。我错误配置了数据库设置还是存在概念误解？

更新

的CTE作为意见提出可惜没不带来任何改善。位图堆扫描已被Sequantial Scan取代，但性能仍然很差。以下是更新的执行计划：

Merge Join (cost=20564929.37..20575876.60 rows=685277 width=106) (actual time=2218133.229..2222280.192 rows=3907472 loops=1) 
    Merge Cond: (ids.id = r.userid) 
    Buffers: shared hit=2408684 read=181785 
    CTE ids 
    -> Values Scan on "*VALUES*" (cost=0.00..1289.70 rows=103176 width=4) (actual time=0.002..28.670 rows=103176 loops=1) 
    CTE ts 
    -> Values Scan on "*VALUES*_1" (cost=0.00..0.05 rows=4 width=32) (actual time=0.002..0.004 rows=4 loops=1) 
    -> Sort (cost=10655.37..10913.31 rows=103176 width=4) (actual time=68.476..83.312 rows=103176 loops=1) 
    Sort Key: ids.id 
    Sort Method: quicksort Memory: 7909kB 
    -> CTE Scan on ids (cost=0.00..2063.52 rows=103176 width=4) (actual time=0.007..47.868 rows=103176 loops=1) 
    -> Sort (cost=20552984.25..20554773.54 rows=715717 width=102) (actual time=2218059.941..2221230.585 rows=8085760 loops=1) 
    Sort Key: r.userid 
    Sort Method: quicksort Memory: 1410084kB 
    Buffers: shared hit=2408684 read=181785 
    -> Nested Loop (cost=0.00..20483384.24 rows=715717 width=102) (actual time=885849.043..2214665.723 rows=8085767 loops=1) 
      Join Filter: (ts.r @> r.ts) 
      Rows Removed by Join Filter: 707630821 
      Buffers: shared hit=2408684 read=181785 
      -> Seq Scan on records r (cost=0.00..4379760.52 rows=178929152 width=70) (actual time=0.024..645616.135 rows=178929147 loops=1) 
       Buffers: shared hit=2408684 read=181785 
      -> CTE Scan on ts (cost=0.00..0.08 rows=4 width=32) (actual time=0.000..0.000 rows=4 loops=178929147) 
Planning time: 126.110 ms 
Execution time: 2222514.566 ms

来源

2017-10-09 user35934

有没有机会受磁盘读取限制？下次请使用'EXPLAIN（ANALYZE，BUFFERS）'。这会让你对缓冲有所了解。正如我们所看到的，堆扫描消耗大部分时间。 https://explain.depesz.com/s/wJBk – filiprem

感谢提示...根据* iotop *磁盘读数波动在1和7 M/s之间，但我也看到一些峰值在17M/s 。一个好的表现应该是什么水平？ – user35934

在这组日期范围中有几个OR，我认为它不会“保持”最近的日期向后推移到最早的给定日期。过滤器指示它处理每个范围'（...或...）和（...或...）和（...或...）和（...或...）' –

你应该得到不同的计划，如果你投的是时间戳的日期和过滤器的值列表，而不是。

CREATE INDEX IF NOT EXISTS userid_ts_idx ON records (userid ASC,cast(ts AS date) DESC); 

SELECT * 
    FROM records 
WHERE userid = ANY(VALUES (2), ..., (96158 more userids)) 
    AND cast(ts AS date) IN('2017-09-01','2017-08-25','2017-08-18','2017-08-11');

是否会更好的表现取决于你的数据和日期范围，因为我在我的情况发现，Postgres将继续使用，即使日期值涵盖整个表的索引（所以以次扫描效果会更好）。

Demo

来源

2017-10-10 06:43:36

将时间戳转换为日期不是一种选择，因为数据是基于小时粒度的，这个查询会过滤掉大量的结果集。 – user35934

所以你真的做了那样的事情？'ts <'2017-08-12 05:00:00'AND ts> ='2017-08-11 13：00：00'' –

对不起，我没有意识到这个部分'IN（' 2017-09-01'，'2017-08-25'，'2017-08-18'，'2017-08-11'）'范围内的过滤器。我会测试你的解决方案。 – user35934

postgresql 9.6.4：大型表上的时间戳范围查询需要永久

回答

相关问题