PostgreSQL查询删除具有重叠时间的记录,同时保留最早的记录?



我正在尝试找出一种方法来删除具有重叠时间的记录,但我无法找到一种简单而优雅的方法来保留所有重叠的记录,除了一个。这个问题与这个问题相似,但有一些不同之处。我们的表格如下所示:

╔════╤═══════════════════════════════════════╤══════════════════════════════════════╤════════╤═════════╗
║ id │ start_time                            │ end_time                             │ bar    │ baz     ║
╠════╪═══════════════════════════════════════╪══════════════════════════════════════╪════════╪═════════╣
║ 0  │ Mon, 18 Dec 2017 16:08:33 UTC +00:00  │ Mon, 18 Dec 2017 17:08:33 UTC +00:00 │ "ham"  │ "eggs"  ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 1  │ Mon, 18 Dec 2017 16:08:32 UTC +00:00  │ Mon, 18 Dec 2017 17:08:32 UTC +00:00 │ "ham"  │ "eggs"  ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 2  │ Mon, 18 Dec 2017 16:08:31 UTC +00:00  │ Mon, 18 Dec 2017 17:08:31 UTC +00:00 │ "spam" │ "bacon" ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 3  │ Mon, 18 Dec 2017 16:08:30 UTC +00:00  │ Mon, 18 Dec 2017 17:08:30 UTC +00:00 │ "ham"  │ "eggs"  ║
╚════╧═══════════════════════════════════════╧══════════════════════════════════════╧════════╧═════════╝

在上面的示例中,所有记录都有重叠时间,其中重叠仅意味着由记录的start_timeend_time(含)定义的时间范围覆盖或延伸到另一条记录的一部分。但是,对于这个问题,我们不仅对那些具有重叠时间的记录感兴趣,而且还对具有匹配的barbaz列(上面的第 0、1 和 3 行)感兴趣。找到这些记录后,我们希望删除除最早记录之外的所有记录,只保留上表的记录 2 和 3,因为记录 2 没有匹配的barbaz列,而记录 3 有并且具有最早的开始和结束时间。

这是我到目前为止所拥有的:

delete from foos where id in (
select
foo_one.id
from
foos foo_one
where
user_id = 42
and exists (
select
1
from
foos foo_two
where
tsrange(foo_two.start_time::timestamp, foo_two.end_time::timestamp, '[]') &&
tsrange(foo_one.start_time::timestamp, foo_one.end_time::timestamp, '[]')
and
foo_one.bar = foo_two.bar
and
foo_one.baz = foo_two.baz
and
user_id = 42
and
foo_one.id != foo_two.id
)
);

感谢您的阅读!

更新:我找到了一个适合我的解决方案,基本上我可以在按barbaz字段分组的表分区上应用窗口函数row_number(),然后将WHERE子句添加到排除第一个条目(id最小的条目)的DELETE语句中。

delete from foos where id in (
select id from (
select
foo_one.id,
row_number() over(partition by
bar,
baz
order by id asc)
from
foos foo_one
where
user_id = 42
and exists (
select
*
from
foos foo_two
where
tsrange(foo_two.start_time::timestamp,
foo_two.end_time::timestamp,
'[]') &&
tsrange(foo_one.start_time::timestamp,
foo_one.end_time::timestamp,
'[]')
and
foo_one.id != foo_two.id
)
) foos where row_number <> 1
);

首先,一个小提示:你真的应该提供更多的信息。我知道您可能不想展示您的业务的一些真实专栏,但以这种方式变得更加难以理解您想要的内容。

但是,我将就该主题提供一些提示。我希望这对您以及有类似问题的人有所帮助。

  1. 您需要明确定义重叠的内容。这对每个人来说可能有很多不同的事情。

查看以下事件:

<--a-->
<---- b ---->
<---- c ---->
<-- d -->
<---- e ---->
<------- f -------->
<--- g --->

如果你定义重叠,如谷歌定义:延伸以覆盖部分,那么"b","d","e"和"f"部分重叠"c"事件。如果您定义重叠,例如覆盖的完整事件,则"c"与"d"重叠,而"f"与"b"和"c"和"d">重叠

  1. 删除组可能是个问题。在前面的那种情况下,我们应该怎么做?我们是否应该删除"b"、"c"和"d",只保留"f"?我们应该对它们的价值求和吗?拿平均值来说吧?所以,这是一个要逐列做出的决定。每列的含义非常重要。所以,我不能用"bar"和"baz"帮助你。

  2. 因此,为了猜测您真正想要什么,我正在创建一个类似的事件表,其中包含 id、begin、end 和 user_id

    create table events (
    id integer,
    user_id integer,
    start_time timestamp,
    end_time timestamp,
    name varchar(100)
    );
    

我正在添加示例值

insert into events
( id, user_id, start_time, end_time, name ) values
( 1, 1000, timestamp('2017-10-09 01:00:00'),timestamp('2017-10-09 04:00:00'), 'a' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 2, 1000, timestamp('2017-10-09 03:00:00'),timestamp('2017-10-09 15:00:00'), 'b' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 3, 1000, timestamp('2017-10-09 07:00:00'),timestamp('2017-10-09 19:00:00'), 'c' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 4, 1000, timestamp('2017-10-09 09:00:00'),timestamp('2017-10-09 17:00:00'), 'd' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 5, 1000, timestamp('2017-10-09 17:00:00'),timestamp('2017-10-09 23:00:00'), 'e' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 6, 1000, timestamp('2017-10-09 02:30:00'),timestamp('2017-10-09 22:00:00'), 'f' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 7, 1000, timestamp('2017-10-09 17:30:00'),timestamp('2017-10-10 02:00:00'), 'g' );

现在,我们可以玩一些不错的查询:

列出与另一个事件完全重叠的所有事件:

select 
# EVENT NAME
event_1.name as event_name,
# LIST EVENTS THAT THE EVENT OVERLAPS
GROUP_CONCAT(event_2.name) as overlaps_names
from events as event_1
inner join events as event_2
on
event_1.user_id = event_2.user_id
and
event_1.id != event_2.id
and
(
# START AFTER THE EVENT ONE
event_2.start_time >= event_1.start_time and
#  ENDS BEFORE THE EVENT ONE
event_2.end_time   <= event_1.end_time
)
group by 
event_1.name

结果:

+------------+----------------+
| event_name | overlaps_names |
+------------+----------------+
| c          | d              |
| f          | b,d,c          |
+------------+----------------+

要检测部分重叠,您需要如下内容:

select 
# EVENT NAME
event_1.name as event_name,
# LIST EVENTS THAT THE EVENT OVERLAPS
GROUP_CONCAT(event_2.name) as overlaps_names
from events as event_1
inner join events as event_2
on
event_1.user_id = event_2.user_id
and
event_1.id != event_2.id
and
(
(
# START AFTER THE EVENT ONE
event_2.start_time >= event_1.start_time and
#  ENDS BEFORE THE EVENT ONE
event_2.start_time <= event_1.end_time
) or
(
# START AFTER THE EVENT ONE
event_2.end_time >= event_1.start_time and
#  ENDS BEFORE THE EVENT ONE
event_2.end_time <= event_1.end_time
)
)
group by 
event_1.name

结果:

+------------+----------------+
| event_name | overlaps_names |
+------------+----------------+
| a          | b,f            |
| b          | c,d,a          |
| c          | b,d,e,g        |
| d          | b,e            |
| e          | f,g,d,c        |
| f          | a,g,b,d,c,e    |
| g          | c,e,f          |
+------------+----------------+

当然,我使用的是"分组依据"来使其更易于阅读。如果您想在删除之前对重叠数据求和或取平均值以更新父数据,这也很有用。也许"group_concat"函数在Postgres中不存在或具有不同的名称。您可以测试的一个"标准SQL"是:

select 
# EVENT NAME
event_1.name as event_name,
# LIST EVENTS THAT THE EVENT OVERLAPS
event_2.name as overlaps_name
from events as event_1
inner join events as event_2
on
event_1.user_id = event_2.user_id
and
event_1.id != event_2.id
and
(
# START AFTER THE EVENT ONE
event_2.start_time >= event_1.start_time and
#  ENDS BEFORE THE EVENT ONE
event_2.end_time   <= event_1.end_time
)

结果:

+------------+---------------+
| event_name | overlaps_name |
+------------+---------------+
| f          | b             |
| f          | c             |
| c          | d             |
| f          | d             |
+------------+---------------+

如果要尝试一些数学运算,请记住在"b"上添加"c"和"d"数据的值并在"f"上再次添加它们的值的风险,从而使"f"的值出错。

// should be
new f = old f + b + old c + d
new c = old c + b + d // unecessary if you are going to delete it
// very common mistake
new c = old c + b + d // unecessary but not wrong yet
new f = new c + b + d = ( old c + b + d ) + b + d // wrong!!

您可以使用此 URL http://sqlfiddle.com/#!9/1d2455/19 测试所有这些查询,并在同一个数据库中创建自己的查询。但是,请记住,它是Mysql,而不是Postgresql。但是测试标准SQL非常好。

最新更新