优化对执行 generate_series() 的大表的查询

以下查询在 PostgreSQL 11.1 中需要 7 分钟以上：

SELECT 
'2019-01-19' as date, 
'2019-01-19'::date - generate_series(first_observed, last_observed, interval '1 day')::date as days_to_date, 
ROUND(AVG(price)) as price,
area_id
FROM 
table_example
GROUP BY 
days_to_date, area_id;

table_example大约有1500 万行。
有什么方法可以优化它吗？我已经添加了以下索引：

CREATE INDEX ON table_example (first_observed, last_observed);
CREATE INDEX ON table_example (area_id);

这是从EXPLAIN (ANALYZE,BUFFERS)输出：

GroupAggregate  (cost=3235559683.68..3377398628.68 rows=1418000 width=72) (actual time=334933.966..440096.869 rows=21688 loops=1)
Group Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
->  Sort  (cost=3235559683.68..3271009671.18 rows=14179995000 width=40) (actual time=334923.933..391690.184 rows=380203171 loops=1)
Sort Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
Sort Method: external merge  Disk: 9187584kB
Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
->  Result  (cost=0.00..390387079.39 rows=14179995000 width=40) (actual time=214.798..171717.941 rows=380203171 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
->  ProjectSet  (cost=0.00..71337191.89 rows=14179995000 width=44) (actual time=214.796..102823.749 rows=380203171 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
->  Seq Scan on table_example  (cost=0.00..259966.95 rows=14179995 width=44) (actual time=0.031..2449.511 rows=14179995 loops=1)
Buffers: local read=118167 dirtied=118167 written=117143
Planning Time: 0.409 ms
JIT:
Functions: 18
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 5.034 ms, Inlining 13.010 ms, Optimization 121.440 ms, Emission 79.996 ms, Total 219.480 ms
Execution Time: 441133.410 ms

这是table_example的样子：

column name        data type
'house_pk'         'integer'    
'date_in'          'date'   
'first_observed'   'date'   
'last_observed'    'date'   
'price'            'numeric'    
'area_id'          'integer'

有60个不同的area_ids。

查询在具有 128 GB 内存的多核计算机(24 个内核)上运行。但是，设置可能不是最佳的。

在处理整个表时，索引通常是无用的(如果表行比索引宽得多，则可能只有索引扫描除外)。

在处理整个表时，我没有看到查询本身性能优化的空间。一件小事：

SELECT d.the_date
,generate_series(d.the_date - last_observed
, d.the_date - first_observed) AS days_to_date
, round(avg(price)) AS price
, area_id
FROM   table_example
, (SELECT date '2019-01-19') AS d(the_date)
GROUP  BY days_to_date, area_id;

假设first_observed和last_observed是date NOT NULL的，并且总是< date '2019-01-19'。否则你需要投射/做更多。

这样，您只有两个减法，然后generate_series()处理整数(最快)。

添加的迷你子查询只是为了方便起见，只提供一次日期。在预准备语句或函数中，您可以使用参数，但不需要这个：

, (SELECT date '2019-01-19') AS d(the_date)

除此之外，如果EXPLAIN (ANALYZE, BUFFERS)提到"磁盘"(例如：Sort Method: external merge Disk: 3240kB)，那么work_mem的(临时)更高设置应该会有所帮助。看：

Linux 上的 PostgreSQL 中的配置参数work_mem
使用按日期和文本优化简单查询

如果您负担不起更多的 RAM，并且聚合和/或排序步骤仍然溢出到磁盘，那么使用LATERAL连接之类的查询进行划分和征服可能会有所帮助：

SELECT d.the_date, f.*, a.area_id
FROM   area a
, (SELECT date '2019-01-19') AS d(the_date)
, LATERAL (
SELECT generate_series(d.the_date - last_observed
, d.the_date - first_observed) AS days_to_date
, round(avg(price)) AS price
FROM   table_example
WHERE  area_id = a.area_id
GROUP  BY 1
) f;

假设一张桌子area，显然。

相关内容

最新更新

热门标签：