我们正在运行Postgres 9.3.5。 (07/2014) 我们有相当复杂的数据仓库/报告设置(ETL,物化视图,索引,聚合,分析功能等)。关于Postgres基于窗口的查询的错误优化/计划(按(,分组?)分区) - 1000x加速
我发现现在可能很难优化来实现,但它使得在性能上的巨大差异(巨大相似性只是示例代码,我们的查询,以减少不必要的复杂性)(?):
create view foo as
select
sum(s.plan) over w_pyl as pyl_plan, -- money planned to spend in this pot/loc/year
sum(s.booked) over w_pyl as pyl_booked, -- money already booked in this pot/loc/year
-- money already booked in this pot/loc the years before (stored as sum already)
last_value(s.booked_prev_years) over w_pl as pl_booked_prev_years,
-- update 2014-10-08: maybe the following additional selected columns
-- may be implementation-/test-relevant since they could potentially be determined
-- by sorting within the partition:
min(s.id) over w_pyl,
max(s.id) over w_pyl,
-- ... anything could follow here ...
x.*,
s.*
from
pot_location_year x -- may be some materialized view or (cache/regular) table
left outer join spendings s
on (s.pot = x.pot and s.loc = x.loc and s.year = x.year)
window
w_pyl as (partition by x.pot, x.year, x.loc)
w_pl as (partition by x.pot, x.loc order by x.year)
我们有这两个相关指标到位:
pot_location_year_idx__p_y_l -- on pot, year, loc
pot_location_year_idx__p_l_y -- on pot, loc, year
现在我们运行一些测试查询
的解释explain select * from foo fetch first 100 rows only
这告诉我们一些非常糟糕的的性能,因为PYL索引用于,其中结果集有:-(受到不必要的排序两次(在最外层WindowAgg/Sort
步排序层因为这是必要的我们last_value(..) as pl_booked_prev_years
):
Limit (cost=289687.87..289692.12 rows=100 width=512)
-> WindowAgg (cost=289687.87..292714.85 rows=93138 width=408)
-> Sort (cost=289687.87..289920.71 rows=93138 width=408)
Sort Key: x.pot, x.loc, x.year
-> WindowAgg (cost=1.25..282000.68 rows=93138 width=408)
-> Nested Loop Left Join (cost=1.25..278508.01 rows=93138 width=408)
Join Filter: ...
-> Nested Loop Left Join (cost=0.83..214569.60 rows=93138 width=392)
-> Index Scan using pot_location_year_idx__p_y_l on pot_location_year x (cost=0.42..11665.49 rows=93138 width=306)
-> Index Scan using ... (cost=0.41..2.17 rows=1 width=140)
Index Cond: ...
-> Index Scan using ... (cost=0.41..0.67 rows=1 width=126)
Index Cond: ...
所以明显的问题是,该策划者应该选择现有一层索引,而不必排序两次。
我可以补充一点,我们有一些从Oracle迁移到Postgres数据库的地方,在Oracle数据库中,这个问题/查询似乎(!)是没有问题的。 (我知道还有很多影响规划和执行的其他因素)。 – 2014-10-07 13:49:01
这可能值得在pgsql-performance邮件列表中提出。尽管我并不完全确定(a,b,c)和(c,a,b)在分割窗口时的语义上完全相同。 – 2014-10-07 13:55:15
只是这样做的:http://postgresql.1045698.n5.nabble.com/Bad-optimization-planning-on-Postgres-window-based-queries-partition-by-group-by-1000x-speedup-td5822190.html – 2014-10-08 06:35:51