SQL Self Join可比较不同天数的数据



我想按天比较产品的差异。目标是获得第1天和第2天、第2天和第3天之间的差异,依此类推

Product  EventTime
X1       T1
X2       T1
X1       T2
X3       T2
X4       T10

票据

  • 事件时间不是线性的(可能是第1天第2天,然后是第10天(
  • 产品由多个属性表示,但为了显示问题,我使用了1个字段

预期结果

Product  Action   EventTime
X1       Added    T1
X2       Added    T1
X2       Removed  T2
X3       Added    T2
X1       Removed  T10
X3       Removed  T10
X4       Added    T10

我的想法是给这些记录指定行号,并进行完整的外部连接以查找差异,但我无法得到正确的结果。

我的思考过程-让我们按事件时间进行排名。

Product  EventTime  RNK
X1       T1         1
X2       T1         1
X1       T2         2
X3       T2         2
X4       T10        3

如果我们做

select 
* 
from 
dataset d1 
full join 
dataset d2
on d1.product = d2.product
and d1.RNK = d2.RNK - 1
where
d1.product is null or d2.product is null

它没有给我正确的结果。但如果我先清理数据,使其成为

Product  EventTime  RNK
--------------------- X1       T1         1 (cross out)
----------------------X2       T1         1
X1       T2         2
X3       T2         2
X4       T10        3 
Product  EventTime  RNK
X1       T1         1
X2       T1         1
X1       T2         2
X3       T2         2
-------------------- X4       T10        3  (cross out)

我们对上述数据集进行了完全联接。我会得到正确的结果,但表现很慢。基本上我去掉了第一个等级和最后一个等级。

有什么想法可以在两组之间按天顺序进行区分吗?

嗯。这看起来像是一个缺口和岛屿问题。您可以使用获取每个产品的时间段

select product, min(time), max(time)
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p
from t
) t
group by product, (seqnum_p - seqnum);

获取删除时间是一个小把戏。您需要使用lead()和一些奇特的聚合:

select product, min(time), max(time),
max(next_time) keep (dense_rank first over order by time desc) as next_time
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p,
min(time) over (order by time range between '1' second following and unbounded following) as next_time
from t
) t
group by product, (seqnum_p - seqnum);

这可能足以满足你的需求。但你可以取消抓取:

with cte as (
select product, min(time) as min_time, 
max(next_time) keep (dense_rank first over order by time desc) as next_time
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p,
min(time) over (order by time range between '1' second following and unbounded following) as next_time
from t
) t
group by product, (seqnum_p - seqnum)
)
select product, 'Added', min_time
from cte
union all
select product 'Removed', next_time
from cte;

实现这一点的一种方法是将其视为";稀疏数据";问题也就是说,您有时间事件,但并不是每个产品都在每个事件中都有代表。

分区的外部联接可以填充稀疏数据,从而生成一个数据集,其中每次都表示每个产品。然后,您可以更容易地看到每次添加和删除的内容。

with event_table (product, event_time) as 
( SELECT 'X1',  trunc(sysdate)+1 FROM DUAL UNION ALL
SELECT 'X2',  trunc(sysdate)+1 FROM DUAL UNION ALL 
SELECT 'X1',  trunc(sysdate)+2 FROM DUAL UNION ALL  
SELECT 'X3',  trunc(sysdate)+2 FROM DUAL UNION ALL  
SELECT 'X4',  trunc(sysdate)+10 FROM DUAL ),
-- solution begins here
-- start by getting a distinct list of event times
distinct_times as ( SELECT DISTINCT event_time FROM event_table ),
-- Next, do a partitioned right join to ensure that every product is represented at every event time.  If the row is sparse data that was added by the right join, et.event_time will be null.
-- We use the lag() function to see what the product looked like at the last event and
-- compare with the current event.
-- NULL -> NULL ==> no change
-- NOT NULL -> NOT NULL ==> no change
-- NULL -> NOT NULL ==> added
-- NOT NULL -> NULL ==> removed
sparse_data_filled as (
select dt.event_time, et.product,
case when lag(et.event_time ) over ( partition by et.product order by dt.event_time ) is null then
-- product wasn't present during last event
case when et.event_time is null then
-- product still is not present
null  -- no change
else
-- product is present now and was not before
'Added'
end
else
-- product was present during last event
case when et.event_time is null then
-- product is no longer present
'Removed'
else
-- product is still present
null   -- no change
end
end message
from event_table et partition by (product) 
right join distinct_times dt on et.event_time = dt.event_time )
SELECT * from sparse_data_filled
-- filter out the non-changes
where message is not null
order by event_time, product
;
+------------+---------+---------+
| EVENT_TIME | PRODUCT | MESSAGE |
+------------+---------+---------+
| 07-AUG-20  | X1      | Added   |
| 07-AUG-20  | X2      | Added   |
| 08-AUG-20  | X2      | Removed |
| 08-AUG-20  | X3      | Added   |
| 16-AUG-20  | X1      | Removed |
| 16-AUG-20  | X3      | Removed |
| 16-AUG-20  | X4      | Added   |
+------------+---------+---------+

更紧凑、仅限解决方案的版本(无测试数据(:

WITH 
distinct_times as ( SELECT DISTINCT event_time FROM event_table ),
changes as (
select dt.event_time, et.product,
case nvl2(et.event_time,1,0) - nvl2(lag(et.event_time ) over ( partition by et.product order by dt.event_time ),1,0)
when +1 then 'Added'
when -1 then 'Removed'
end message
from event_table et partition by (product) 
right join distinct_times dt on et.event_time = dt.event_time )
SELECT * from changes
where message is not null
order by event_time, product

相关内容

  • 没有找到相关文章

最新更新