我想按天比较产品的差异。目标是获得第1天和第2天、第2天和第3天之间的差异,依此类推
Product EventTime
X1 T1
X2 T1
X1 T2
X3 T2
X4 T10
票据
- 事件时间不是线性的(可能是第1天第2天,然后是第10天(
- 产品由多个属性表示,但为了显示问题,我使用了1个字段
预期结果
Product Action EventTime
X1 Added T1
X2 Added T1
X2 Removed T2
X3 Added T2
X1 Removed T10
X3 Removed T10
X4 Added T10
我的想法是给这些记录指定行号,并进行完整的外部连接以查找差异,但我无法得到正确的结果。
我的思考过程-让我们按事件时间进行排名。
Product EventTime RNK
X1 T1 1
X2 T1 1
X1 T2 2
X3 T2 2
X4 T10 3
如果我们做
select
*
from
dataset d1
full join
dataset d2
on d1.product = d2.product
and d1.RNK = d2.RNK - 1
where
d1.product is null or d2.product is null
它没有给我正确的结果。但如果我先清理数据,使其成为
Product EventTime RNK
--------------------- X1 T1 1 (cross out)
----------------------X2 T1 1
X1 T2 2
X3 T2 2
X4 T10 3
Product EventTime RNK
X1 T1 1
X2 T1 1
X1 T2 2
X3 T2 2
-------------------- X4 T10 3 (cross out)
我们对上述数据集进行了完全联接。我会得到正确的结果,但表现很慢。基本上我去掉了第一个等级和最后一个等级。
有什么想法可以在两组之间按天顺序进行区分吗?
嗯。这看起来像是一个缺口和岛屿问题。您可以使用获取每个产品的时间段
select product, min(time), max(time)
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p
from t
) t
group by product, (seqnum_p - seqnum);
获取删除时间是一个小把戏。您需要使用lead()
和一些奇特的聚合:
select product, min(time), max(time),
max(next_time) keep (dense_rank first over order by time desc) as next_time
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p,
min(time) over (order by time range between '1' second following and unbounded following) as next_time
from t
) t
group by product, (seqnum_p - seqnum);
这可能足以满足你的需求。但你可以取消抓取:
with cte as (
select product, min(time) as min_time,
max(next_time) keep (dense_rank first over order by time desc) as next_time
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p,
min(time) over (order by time range between '1' second following and unbounded following) as next_time
from t
) t
group by product, (seqnum_p - seqnum)
)
select product, 'Added', min_time
from cte
union all
select product 'Removed', next_time
from cte;
实现这一点的一种方法是将其视为";稀疏数据";问题也就是说,您有时间事件,但并不是每个产品都在每个事件中都有代表。
分区的外部联接可以填充稀疏数据,从而生成一个数据集,其中每次都表示每个产品。然后,您可以更容易地看到每次添加和删除的内容。
with event_table (product, event_time) as
( SELECT 'X1', trunc(sysdate)+1 FROM DUAL UNION ALL
SELECT 'X2', trunc(sysdate)+1 FROM DUAL UNION ALL
SELECT 'X1', trunc(sysdate)+2 FROM DUAL UNION ALL
SELECT 'X3', trunc(sysdate)+2 FROM DUAL UNION ALL
SELECT 'X4', trunc(sysdate)+10 FROM DUAL ),
-- solution begins here
-- start by getting a distinct list of event times
distinct_times as ( SELECT DISTINCT event_time FROM event_table ),
-- Next, do a partitioned right join to ensure that every product is represented at every event time. If the row is sparse data that was added by the right join, et.event_time will be null.
-- We use the lag() function to see what the product looked like at the last event and
-- compare with the current event.
-- NULL -> NULL ==> no change
-- NOT NULL -> NOT NULL ==> no change
-- NULL -> NOT NULL ==> added
-- NOT NULL -> NULL ==> removed
sparse_data_filled as (
select dt.event_time, et.product,
case when lag(et.event_time ) over ( partition by et.product order by dt.event_time ) is null then
-- product wasn't present during last event
case when et.event_time is null then
-- product still is not present
null -- no change
else
-- product is present now and was not before
'Added'
end
else
-- product was present during last event
case when et.event_time is null then
-- product is no longer present
'Removed'
else
-- product is still present
null -- no change
end
end message
from event_table et partition by (product)
right join distinct_times dt on et.event_time = dt.event_time )
SELECT * from sparse_data_filled
-- filter out the non-changes
where message is not null
order by event_time, product
;
+------------+---------+---------+ | EVENT_TIME | PRODUCT | MESSAGE | +------------+---------+---------+ | 07-AUG-20 | X1 | Added | | 07-AUG-20 | X2 | Added | | 08-AUG-20 | X2 | Removed | | 08-AUG-20 | X3 | Added | | 16-AUG-20 | X1 | Removed | | 16-AUG-20 | X3 | Removed | | 16-AUG-20 | X4 | Added | +------------+---------+---------+
更紧凑、仅限解决方案的版本(无测试数据(:
WITH
distinct_times as ( SELECT DISTINCT event_time FROM event_table ),
changes as (
select dt.event_time, et.product,
case nvl2(et.event_time,1,0) - nvl2(lag(et.event_time ) over ( partition by et.product order by dt.event_time ),1,0)
when +1 then 'Added'
when -1 then 'Removed'
end message
from event_table et partition by (product)
right join distinct_times dt on et.event_time = dt.event_time )
SELECT * from changes
where message is not null
order by event_time, product