PostgreSQL查询速度是可变的



上下文

我有一个表来保存网络流量数据(路由器截获的所有数据包)。该表目前约有590万行。

问题

我正在尝试一个简单的查询来计算每天收到的数据包数量,这应该不会花很长时间。

第一次运行它时,查询需要88秒,第二次运行后,33秒秒

主要问题不在于查询的速度,而是在执行同一查询3次后,速度快了近20倍
我理解查询缓存的概念,但是原始查询运行的性能对我来说毫无意义。

测试

我用来加入的列(日期时间)的类型是timestamptz,并且被索引为:

CREATE INDEX date ON netflows USING btree (datetime);

查看EXPLAIN语句。执行上的差异在于Nested Loop

我已经VACUUM ANALYZE了,得到了完全相同的结果。

当前环境

  • Linux Ubuntu 12.04 VM在VMware ESX 4.1上运行
  • PostgreSQL 9.1
  • 虚拟机有2 GB RAM,2个内核
  • 数据库服务器完全专用于此,不做任何其他事情
  • 每分钟在表中插入一次(每分钟100行)
  • 磁盘、ram或cpu活动非常低

查询

with date_list as (
select
series as start_date,
series + '23:59:59' as end_date
from
generate_series(
(select min(datetime) from netflows)::date, 
(select max(datetime) from netflows)::date, 
'1 day') as series
)
select
start_date,
end_date,
count(*)
from
netflows
inner join date_list on (datetime between start_date and end_date)
group by
start_date,
end_date;

解释第一次跑步(88秒)

Sort  (cost=27007355.59..27007356.09 rows=200 width=8) (actual time=89647.054..89647.055 rows=18 loops=1) 
Sort Key: date_list.start_date 
Sort Method: quicksort  Memory: 25kB 
CTE date_list 
->  Function Scan on generate_series series  (cost=0.13..12.63 rows=1000 width=8) (actual time=92.567..92.667 rows=19 loops=1) 
InitPlan 2 (returns $1) 
->  Result  (cost=0.05..0.06 rows=1 width=0) (actual time=71.270..71.270 rows=1 loops=1) 
InitPlan 1 (returns $0) 
->  Limit  (cost=0.00..0.05 rows=1 width=8) (actual time=71.259..71.261 rows=1 loops=1) 
->  Index Scan using date on netflows  (cost=0.00..303662.15 rows=5945591 width=8) (actual time=71.252..71.252 rows=1 loops=1) 
Index Cond: (datetime IS NOT NULL) 
InitPlan 4 (returns $3) 
->  Result  (cost=0.05..0.06 rows=1 width=0) (actual time=11.786..11.787 rows=1 loops=1) 
InitPlan 3 (returns $2) 
->  Limit  (cost=0.00..0.05 rows=1 width=8) (actual time=11.778..11.779 rows=1 loops=1) 
->  Index Scan Backward using date on netflows  (cost=0.00..303662.15 rows=5945591 width=8) (actual time=11.776..11.776 rows=1 loops=1) 
Index Cond: (datetime IS NOT NULL) 
->  HashAggregate  (cost=27007333.31..27007335.31 rows=200 width=8) (actual time=89639.167..89639.179 rows=18 loops=1) 
->  Nested Loop  (cost=0.00..23704227.20 rows=660621222 width=8) (actual time=92.667..88059.576 rows=5945457 loops=1) 
->  CTE Scan on date_list  (cost=0.00..20.00 rows=1000 width=16) (actual time=92.578..92.785 rows=19 loops=1) 
->  Index Scan using date on netflows  (cost=0.00..13794.89 rows=660621 width=8) (actual time=2.438..4571.884 rows=312919 loops=19) 
Index Cond: ((datetime >= date_list.start_date) AND (datetime <= date_list.end_date)) 
Total runtime: 89668.047 ms 

第三次跑步解释(5秒)

Sort  (cost=27011357.45..27011357.95 rows=200 width=8) (actual time=5645.031..5645.032 rows=18 loops=1) 
Sort Key: date_list.start_date 
Sort Method: quicksort  Memory: 25kB 
CTE date_list 
->  Function Scan on generate_series series  (cost=0.13..12.63 rows=1000 width=8) (actual time=0.108..0.204 rows=19 loops=1) 
InitPlan 2 (returns $1) 
->  Result  (cost=0.05..0.06 rows=1 width=0) (actual time=0.050..0.050 rows=1 loops=1) 
InitPlan 1 (returns $0) 
->  Limit  (cost=0.00..0.05 rows=1 width=8) (actual time=0.046..0.046 rows=1 loops=1) 
->  Index Scan using date on netflows  (cost=0.00..303705.14 rows=5946469 width=8) (actual time=0.046..0.046 rows=1 loops=1) 
Index Cond: (datetime IS NOT NULL) 
InitPlan 4 (returns $3) 
->  Result  (cost=0.05..0.06 rows=1 width=0) (actual time=0.026..0.026 rows=1 loops=1) 
InitPlan 3 (returns $2) 
->  Limit  (cost=0.00..0.05 rows=1 width=8) (actual time=0.026..0.026 rows=1 loops=1) 
->  Index Scan Backward using date on netflows  (cost=0.00..303705.14 rows=5946469 width=8) (actual time=0.026..0.026 rows=1 loops=1) 
Index Cond: (datetime IS NOT NULL) 
->  HashAggregate  (cost=27011335.17..27011337.17 rows=200 width=8) (actual time=5645.005..5645.009 rows=18 loops=1) 
->  Nested Loop  (cost=0.00..23707741.28 rows=660718778 width=8) (actual time=0.134..4176.406 rows=5946329 loops=1) 
->  CTE Scan on date_list  (cost=0.00..20.00 rows=1000 width=16) (actual time=0.110..0.343 rows=19 loops=1) 
->  Index Scan using date on netflows  (cost=0.00..13796.94 rows=660719 width=8) (actual time=0.026..164.117 rows=312965 loops=19) 
Index Cond: ((datetime >= date_list.start_date) AND (datetime <= date_list.end_date)) 
Total runtime: 5645.189 ms

如果您正在执行INNER JOIN,我认为您根本不需要CTE。您可以定义

select
datetime::date,
count(*)
from netflows
group by datetime::date /* or GROUP BY 1 as Postgres extension */

我不明白你为什么需要日期表,除非你想让LEFT JOIN在适当的地方得到零。这意味着一次通过数据。

顺便说一句,我不鼓励你在实体和列中使用日期和日期时间等关键字;即使是合法的,也不值得。

WITH date_list as (
SELECT t                  AS start_date
,(t + interval '1d') AS end_date
FROM  (
SELECT generate_series((min(datetime))::date
,(max(datetime))::date
,'1d') AS t
FROM   netflows
) x
)
SELECT d.start_date
,count(*) AS ct
FROM   date_list     d
LEFT   JOIN netflows n ON n.datetime >= d.start_date
AND n.datetime <  d.end_date
GROUP  BY d.start_date;

并为你的索引使用一个合适的名称(@Andrew已经暗示):

CREATE INDEX netflows_date_idx ON netflows (datetime);

要点

  • 假设你想在日历的每一天都排一排,就像@Andrew在他的回答中提到的那样,我用LEFT JOIN替换了JOIN

  • 一个查询中从网流中获取min()max()要高效得多。

  • 简化型铸造。

  • 固定了日期范围。对于像'2012-12-06 23:59:59.123'这样的时间戳,您的代码将失败。

在一张大桌子上测试了这一点,性能很好
关于最初的问题:毫无疑问,缓存效果是意料之中的,尤其是在RAM有限的情况下。

相关内容

  • 没有找到相关文章

最新更新