我正在尝试找出一种方法来删除具有重叠时间的记录,但我无法找到一种简单而优雅的方法来保留所有重叠的记录,除了一个。这个问题与这个问题相似,但有一些不同之处。我们的表格如下所示:
╔════╤═══════════════════════════════════════╤══════════════════════════════════════╤════════╤═════════╗
║ id │ start_time │ end_time │ bar │ baz ║
╠════╪═══════════════════════════════════════╪══════════════════════════════════════╪════════╪═════════╣
║ 0 │ Mon, 18 Dec 2017 16:08:33 UTC +00:00 │ Mon, 18 Dec 2017 17:08:33 UTC +00:00 │ "ham" │ "eggs" ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 1 │ Mon, 18 Dec 2017 16:08:32 UTC +00:00 │ Mon, 18 Dec 2017 17:08:32 UTC +00:00 │ "ham" │ "eggs" ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 2 │ Mon, 18 Dec 2017 16:08:31 UTC +00:00 │ Mon, 18 Dec 2017 17:08:31 UTC +00:00 │ "spam" │ "bacon" ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 3 │ Mon, 18 Dec 2017 16:08:30 UTC +00:00 │ Mon, 18 Dec 2017 17:08:30 UTC +00:00 │ "ham" │ "eggs" ║
╚════╧═══════════════════════════════════════╧══════════════════════════════════════╧════════╧═════════╝
在上面的示例中,所有记录都有重叠时间,其中重叠仅意味着由记录的start_time
和end_time
(含)定义的时间范围覆盖或延伸到另一条记录的一部分。但是,对于这个问题,我们不仅对那些具有重叠时间的记录感兴趣,而且还对具有匹配的bar
和baz
列(上面的第 0、1 和 3 行)感兴趣。找到这些记录后,我们希望删除除最早记录之外的所有记录,只保留上表的记录 2 和 3,因为记录 2 没有匹配的bar
和baz
列,而记录 3 有并且具有最早的开始和结束时间。
这是我到目前为止所拥有的:
delete from foos where id in (
select
foo_one.id
from
foos foo_one
where
user_id = 42
and exists (
select
1
from
foos foo_two
where
tsrange(foo_two.start_time::timestamp, foo_two.end_time::timestamp, '[]') &&
tsrange(foo_one.start_time::timestamp, foo_one.end_time::timestamp, '[]')
and
foo_one.bar = foo_two.bar
and
foo_one.baz = foo_two.baz
and
user_id = 42
and
foo_one.id != foo_two.id
)
);
感谢您的阅读!
更新:我找到了一个适合我的解决方案,基本上我可以在按bar
和baz
字段分组的表分区上应用窗口函数row_number()
,然后将WHERE
子句添加到排除第一个条目(id
最小的条目)的DELETE
语句中。
delete from foos where id in (
select id from (
select
foo_one.id,
row_number() over(partition by
bar,
baz
order by id asc)
from
foos foo_one
where
user_id = 42
and exists (
select
*
from
foos foo_two
where
tsrange(foo_two.start_time::timestamp,
foo_two.end_time::timestamp,
'[]') &&
tsrange(foo_one.start_time::timestamp,
foo_one.end_time::timestamp,
'[]')
and
foo_one.id != foo_two.id
)
) foos where row_number <> 1
);
首先,一个小提示:你真的应该提供更多的信息。我知道您可能不想展示您的业务的一些真实专栏,但以这种方式变得更加难以理解您想要的内容。
但是,我将就该主题提供一些提示。我希望这对您以及有类似问题的人有所帮助。
- 您需要明确定义重叠的内容。这对每个人来说可能有很多不同的事情。
查看以下事件:
<--a-->
<---- b ---->
<---- c ---->
<-- d -->
<---- e ---->
<------- f -------->
<--- g --->
如果你定义重叠,如谷歌定义:延伸以覆盖部分,那么"b","d","e"和"f"部分重叠"c"事件。如果您定义重叠,例如覆盖的完整事件,则"c"与"d"重叠,而"f"与"b"和"c"和"d">重叠。
删除组可能是个问题。在前面的那种情况下,我们应该怎么做?我们是否应该删除"b"、"c"和"d",只保留"f"?我们应该对它们的价值求和吗?拿平均值来说吧?所以,这是一个要逐列做出的决定。每列的含义非常重要。所以,我不能用"bar"和"baz"帮助你。
因此,为了猜测您真正想要什么,我正在创建一个类似的事件表,其中包含 id、begin、end 和 user_id
create table events ( id integer, user_id integer, start_time timestamp, end_time timestamp, name varchar(100) );
我正在添加示例值
insert into events
( id, user_id, start_time, end_time, name ) values
( 1, 1000, timestamp('2017-10-09 01:00:00'),timestamp('2017-10-09 04:00:00'), 'a' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 2, 1000, timestamp('2017-10-09 03:00:00'),timestamp('2017-10-09 15:00:00'), 'b' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 3, 1000, timestamp('2017-10-09 07:00:00'),timestamp('2017-10-09 19:00:00'), 'c' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 4, 1000, timestamp('2017-10-09 09:00:00'),timestamp('2017-10-09 17:00:00'), 'd' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 5, 1000, timestamp('2017-10-09 17:00:00'),timestamp('2017-10-09 23:00:00'), 'e' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 6, 1000, timestamp('2017-10-09 02:30:00'),timestamp('2017-10-09 22:00:00'), 'f' );
insert into events
( id, user_id, start_time, end_time, name ) values
( 7, 1000, timestamp('2017-10-09 17:30:00'),timestamp('2017-10-10 02:00:00'), 'g' );
现在,我们可以玩一些不错的查询:
列出与另一个事件完全重叠的所有事件:
select
# EVENT NAME
event_1.name as event_name,
# LIST EVENTS THAT THE EVENT OVERLAPS
GROUP_CONCAT(event_2.name) as overlaps_names
from events as event_1
inner join events as event_2
on
event_1.user_id = event_2.user_id
and
event_1.id != event_2.id
and
(
# START AFTER THE EVENT ONE
event_2.start_time >= event_1.start_time and
# ENDS BEFORE THE EVENT ONE
event_2.end_time <= event_1.end_time
)
group by
event_1.name
结果:
+------------+----------------+
| event_name | overlaps_names |
+------------+----------------+
| c | d |
| f | b,d,c |
+------------+----------------+
要检测部分重叠,您需要如下内容:
select
# EVENT NAME
event_1.name as event_name,
# LIST EVENTS THAT THE EVENT OVERLAPS
GROUP_CONCAT(event_2.name) as overlaps_names
from events as event_1
inner join events as event_2
on
event_1.user_id = event_2.user_id
and
event_1.id != event_2.id
and
(
(
# START AFTER THE EVENT ONE
event_2.start_time >= event_1.start_time and
# ENDS BEFORE THE EVENT ONE
event_2.start_time <= event_1.end_time
) or
(
# START AFTER THE EVENT ONE
event_2.end_time >= event_1.start_time and
# ENDS BEFORE THE EVENT ONE
event_2.end_time <= event_1.end_time
)
)
group by
event_1.name
结果:
+------------+----------------+
| event_name | overlaps_names |
+------------+----------------+
| a | b,f |
| b | c,d,a |
| c | b,d,e,g |
| d | b,e |
| e | f,g,d,c |
| f | a,g,b,d,c,e |
| g | c,e,f |
+------------+----------------+
当然,我使用的是"分组依据"来使其更易于阅读。如果您想在删除之前对重叠数据求和或取平均值以更新父数据,这也很有用。也许"group_concat"函数在Postgres中不存在或具有不同的名称。您可以测试的一个"标准SQL"是:
select
# EVENT NAME
event_1.name as event_name,
# LIST EVENTS THAT THE EVENT OVERLAPS
event_2.name as overlaps_name
from events as event_1
inner join events as event_2
on
event_1.user_id = event_2.user_id
and
event_1.id != event_2.id
and
(
# START AFTER THE EVENT ONE
event_2.start_time >= event_1.start_time and
# ENDS BEFORE THE EVENT ONE
event_2.end_time <= event_1.end_time
)
结果:
+------------+---------------+
| event_name | overlaps_name |
+------------+---------------+
| f | b |
| f | c |
| c | d |
| f | d |
+------------+---------------+
如果要尝试一些数学运算,请记住在"b"上添加"c"和"d"数据的值并在"f"上再次添加它们的值的风险,从而使"f"的值出错。
// should be
new f = old f + b + old c + d
new c = old c + b + d // unecessary if you are going to delete it
// very common mistake
new c = old c + b + d // unecessary but not wrong yet
new f = new c + b + d = ( old c + b + d ) + b + d // wrong!!
您可以使用此 URL http://sqlfiddle.com/#!9/1d2455/19 测试所有这些查询,并在同一个数据库中创建自己的查询。但是,请记住,它是Mysql,而不是Postgresql。但是测试标准SQL非常好。