SQL Self Join可比较不同天数的数据

我想按天比较产品的差异。目标是获得第1天和第2天、第2天和第3天之间的差异，依此类推

Product  EventTime
X1       T1
X2       T1
X1       T2
X3       T2
X4       T10

票据

事件时间不是线性的(可能是第1天第2天，然后是第10天(
产品由多个属性表示，但为了显示问题，我使用了1个字段

预期结果

Product  Action   EventTime
X1       Added    T1
X2       Added    T1
X2       Removed  T2
X3       Added    T2
X1       Removed  T10
X3       Removed  T10
X4       Added    T10

我的想法是给这些记录指定行号，并进行完整的外部连接以查找差异，但我无法得到正确的结果。

我的思考过程-让我们按事件时间进行排名。

Product  EventTime  RNK
X1       T1         1
X2       T1         1
X1       T2         2
X3       T2         2
X4       T10        3

如果我们做

select 
* 
from 
dataset d1 
full join 
dataset d2
on d1.product = d2.product
and d1.RNK = d2.RNK - 1
where
d1.product is null or d2.product is null

它没有给我正确的结果。但如果我先清理数据，使其成为

Product  EventTime  RNK
--------------------- X1       T1         1 (cross out)
----------------------X2       T1         1
X1       T2         2
X3       T2         2
X4       T10        3 
Product  EventTime  RNK
X1       T1         1
X2       T1         1
X1       T2         2
X3       T2         2
-------------------- X4       T10        3  (cross out)

我们对上述数据集进行了完全联接。我会得到正确的结果，但表现很慢。基本上我去掉了第一个等级和最后一个等级。

有什么想法可以在两组之间按天顺序进行区分吗？

嗯。这看起来像是一个缺口和岛屿问题。您可以使用获取每个产品的时间段

select product, min(time), max(time)
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p
from t
) t
group by product, (seqnum_p - seqnum);

获取删除时间是一个小把戏。您需要使用lead()和一些奇特的聚合：

select product, min(time), max(time),
max(next_time) keep (dense_rank first over order by time desc) as next_time
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p,
min(time) over (order by time range between '1' second following and unbounded following) as next_time
from t
) t
group by product, (seqnum_p - seqnum);

这可能足以满足你的需求。但你可以取消抓取：

with cte as (
select product, min(time) as min_time, 
max(next_time) keep (dense_rank first over order by time desc) as next_time
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p,
min(time) over (order by time range between '1' second following and unbounded following) as next_time
from t
) t
group by product, (seqnum_p - seqnum)
)
select product, 'Added', min_time
from cte
union all
select product 'Removed', next_time
from cte;

实现这一点的一种方法是将其视为"；稀疏数据"；问题也就是说，您有时间事件，但并不是每个产品都在每个事件中都有代表。

分区的外部联接可以填充稀疏数据，从而生成一个数据集，其中每次都表示每个产品。然后，您可以更容易地看到每次添加和删除的内容。

with event_table (product, event_time) as 
( SELECT 'X1',  trunc(sysdate)+1 FROM DUAL UNION ALL
SELECT 'X2',  trunc(sysdate)+1 FROM DUAL UNION ALL 
SELECT 'X1',  trunc(sysdate)+2 FROM DUAL UNION ALL  
SELECT 'X3',  trunc(sysdate)+2 FROM DUAL UNION ALL  
SELECT 'X4',  trunc(sysdate)+10 FROM DUAL ),
-- solution begins here
-- start by getting a distinct list of event times
distinct_times as ( SELECT DISTINCT event_time FROM event_table ),
-- Next, do a partitioned right join to ensure that every product is represented at every event time.  If the row is sparse data that was added by the right join, et.event_time will be null.
-- We use the lag() function to see what the product looked like at the last event and
-- compare with the current event.
-- NULL -> NULL ==> no change
-- NOT NULL -> NOT NULL ==> no change
-- NULL -> NOT NULL ==> added
-- NOT NULL -> NULL ==> removed
sparse_data_filled as (
select dt.event_time, et.product,
case when lag(et.event_time ) over ( partition by et.product order by dt.event_time ) is null then
-- product wasn't present during last event
case when et.event_time is null then
-- product still is not present
null  -- no change
else
-- product is present now and was not before
'Added'
end
else
-- product was present during last event
case when et.event_time is null then
-- product is no longer present
'Removed'
else
-- product is still present
null   -- no change
end
end message
from event_table et partition by (product) 
right join distinct_times dt on et.event_time = dt.event_time )
SELECT * from sparse_data_filled
-- filter out the non-changes
where message is not null
order by event_time, product
;

+------------+---------+---------+
| EVENT_TIME | PRODUCT | MESSAGE |
+------------+---------+---------+
| 07-AUG-20  | X1      | Added   |
| 07-AUG-20  | X2      | Added   |
| 08-AUG-20  | X2      | Removed |
| 08-AUG-20  | X3      | Added   |
| 16-AUG-20  | X1      | Removed |
| 16-AUG-20  | X3      | Removed |
| 16-AUG-20  | X4      | Added   |
+------------+---------+---------+

更紧凑、仅限解决方案的版本(无测试数据(：

WITH 
distinct_times as ( SELECT DISTINCT event_time FROM event_table ),
changes as (
select dt.event_time, et.product,
case nvl2(et.event_time,1,0) - nvl2(lag(et.event_time ) over ( partition by et.product order by dt.event_time ),1,0)
when +1 then 'Added'
when -1 then 'Removed'
end message
from event_table et partition by (product) 
right join distinct_times dt on et.event_time = dt.event_time )
SELECT * from changes
where message is not null
order by event_time, product

相关内容

最新更新

热门标签：