SQL使用LAG不断查找值,直到满足条件为止



从本质上讲,我的表中有我所称的浮动行。这些行具有type = mod,并且它们与表中的customer_id相关联,但不与另一个transaction_id相关联。我需要他们与另一个transaction_id联系在一起,而不仅仅是自己闲逛。因此,我想创建一个映射表,该表提供浮动的transaction_id值和与其关联的previous_transaction_id。我正在使用LAG来获得previous_transaction_id,这适用于某些场景,但不是所有场景。从本质上讲,我想告诉LAG——";嘿,LAG,如果你找到的值是这些浮动的transaction_id值之一,那么继续后退,直到你找到一个不是"的值;但我不知道该怎么做。

我正在使用Snowflake,如果有更好的方法的话,我可以使用jinja和dbt。

以下是我所拥有的各种场景和当前查询的一些模拟数据:

with data as (
select 
transaction_id,
customer_id,
transaction_date,
amount,
type
from (values 
(1, 'a','03/24/2022'::date, 10, 'cat'),
(1, 'a','03/24/2022'::date, 15, 'dog'),
(1, 'a','03/24/2022'::date, 20, 'mouse'),
(1, 'a','03/24/2022'::date, 30, 'rabbit'),
(1, 'a','03/24/2022'::date, 5, 'squirrel'),
(2, 'a','03/24/2022'::date, 4, 'mod'), -- floater
(3, 'b','05/20/2022'::date, 100, 'cat'),
(3, 'b','05/20/2022'::date, 150, 'dog'),
(3, 'b','05/20/2022'::date, 200, 'mouse'),
(3, 'b','05/20/2022'::date, 300, 'rabbit'),
(3, 'b','05/20/2022'::date, 50, 'squirrel'),
(4, 'b','07/20/2022'::date, 40, 'mod'), -- floater
(5, 'c','02/02/2020'::date, 100, 'cat'),
(5, 'c','02/02/2020'::date, 150, 'dog'),
(5, 'c','02/02/2020'::date, 200, 'mouse'),
(5, 'c','02/02/2020'::date, 300, 'rabbit'),
(6, 'c','08/01/2020'::date, 50, 'mod'), -- floater
(7, 'c','12/25/2020'::date, 40, 'mod'), -- floater
(8, 'd','01/15/2021'::date, 10, 'cat'),
(8, 'd','01/15/2021'::date, 15, 'dog'),
(8, 'd','01/15/2021'::date, 20, 'mouse'),
(8, 'd','01/15/2021'::date, 30, 'rabbit'),
(8, 'd','01/15/2021'::date, 5, 'squirrel'),
(8, 'd','01/15/2021'::date, 4, 'mod'),
(9, 'e','02/10/2020'::date, 100, 'cat'),
(9, 'e','02/10/2020'::date, 150, 'dog'),
(9, 'e','02/10/2020'::date, 200, 'mouse'),
(9, 'e','02/10/2020'::date, 300, 'rabbit'),
(10, 'e','08/17/2020'::date, 50, 'mod'), -- floater
(11, 'e','12/15/2020'::date, 40, 'mod'), -- floater
(12, 'e','02/14/2021'::date, 40, 'mod'), -- floater
(13, 'c','04/09/2022'::date, 0, 'mouse'),
(13, 'c','04/09/2022'::date, 0, 'rabbit'),
(13, 'c','04/09/2022'::date, 50, 'mod') -- floater because other values for transaction_id sum to 0
) as tbl (transaction_id, customer_id, transaction_date, amount, type)
),
previous_transaction_id as (
select 
transaction_id,
customer_id,
lag(transaction_id, 1, null) over (partition by customer_id order by transaction_date) as previous_transaction_id
from data
qualify transaction_id != previous_transaction_id
),
floating_mods as (
select 
transaction_id,
sum(iff(type = 'mod', amount, 0)) as mod_amount,
sum(amount) - mod_amount as non_mod_amount
from data
group by 1
having non_mod_amount = 0
)
select 
gp.transaction_id,
gp.previous_transaction_id
from previous_transaction_id gp
inner join floating_mods fm on gp.transaction_id = fm.transaction_id
order by gp.transaction_id

这是查询的输出:

TRANSACTION_IDPREVIUS_TRANACTION_ID
21
43
65
76
109
1110
1211
137

感谢Greg,我能够得到我需要的东西:

with data as (
select 
transaction_id,
customer_id,
transaction_date,
amount,
type
from (values 
(1, 'a','03/24/2022'::date, 10, 'cat'),
(1, 'a','03/24/2022'::date, 15, 'dog'),
(1, 'a','03/24/2022'::date, 20, 'mouse'),
(1, 'a','03/24/2022'::date, 30, 'rabbit'),
(1, 'a','03/24/2022'::date, 5, 'squirrel'),
(2, 'a','03/24/2022'::date, 4, 'mod'),
(3, 'b','05/20/2022'::date, 100, 'cat'),
(3, 'b','05/20/2022'::date, 150, 'dog'),
(3, 'b','05/20/2022'::date, 200, 'mouse'),
(3, 'b','05/20/2022'::date, 300, 'rabbit'),
(3, 'b','05/20/2022'::date, 50, 'squirrel'),
(4, 'b','07/20/2022'::date, 40, 'mod'),
(5, 'c','02/02/2020'::date, 100, 'cat'),
(5, 'c','02/02/2020'::date, 150, 'dog'),
(5, 'c','02/02/2020'::date, 200, 'mouse'),
(5, 'c','02/02/2020'::date, 300, 'rabbit'),
(6, 'c','08/01/2020'::date, 50, 'mod'),
(7, 'c','12/25/2020'::date, 40, 'mod'),
(8, 'd','01/15/2021'::date, 10, 'cat'),
(8, 'd','01/15/2021'::date, 15, 'dog'),
(8, 'd','01/15/2021'::date, 20, 'mouse'),
(8, 'd','01/15/2021'::date, 30, 'rabbit'),
(8, 'd','01/15/2021'::date, 5, 'squirrel'),
(8, 'd','01/15/2021'::date, 4, 'mod'),
(9, 'e','02/10/2020'::date, 100, 'cat'),
(9, 'e','02/10/2020'::date, 150, 'dog'),
(9, 'e','02/10/2020'::date, 200, 'mouse'),
(9, 'e','02/10/2020'::date, 300, 'rabbit'),
(10, 'e','08/17/2020'::date, 50, 'mod'),
(11, 'e','12/15/2020'::date, 40, 'mod'),
(12, 'e','02/14/2021'::date, 40, 'mod'),
(13, 'c','04/09/2022'::date, 0, 'mouse'),
(13, 'c','04/09/2022'::date, 0, 'rabbit'),
(13, 'c','04/09/2022'::date, 50, 'mod'),
(14, 'c','05/09/2022'::date, 0, 'mod')
) as tbl (transaction_id, customer_id, transaction_date, amount, type)
),
floating_mods as (
select 
transaction_id,
sum(iff(type = 'mod', amount, 0)) as mod_amount,
sum(amount) - mod_amount as non_mod_amount
from data
group by 1
having non_mod_amount = 0
),
previous_transaction_id as (
select 
data.transaction_id,
data.customer_id,
lag(iff(floating_mods.transaction_id is not null, null, data.transaction_id), 1, null) ignore nulls over (partition by data.customer_id order by data.transaction_date) as previous_transaction_id
from data
left join floating_mods on data.transaction_id = floating_mods.transaction_id
qualify data.transaction_id != previous_transaction_id
)
select 
gp.transaction_id,
gp.previous_transaction_id
from previous_transaction_id gp
inner join floating_mods fm on gp.transaction_id = fm.transaction_id
order by gp.transaction_id

我不确定这是否准确地捕捉到了规则,但它应该足够接近,可以在必要时进行调整。LAG函数应该忽略null,并且可以使用IFF函数或CASE WHEN块指定LAG函数的目的是什么。

with data as (
select 
transaction_id,
customer_id,
transaction_date,
amount,
type
from (values 
(1, 'a','03/24/2022'::date, 10, 'cat'),
(1, 'a','03/24/2022'::date, 15, 'dog'),
(1, 'a','03/24/2022'::date, 20, 'mouse'),
(1, 'a','03/24/2022'::date, 30, 'rabbit'),
(1, 'a','03/24/2022'::date, 5, 'squirrel'),
(2, 'a','03/24/2022'::date, 4, 'mod'), -- floater
(3, 'b','05/20/2022'::date, 100, 'cat'),
(3, 'b','05/20/2022'::date, 150, 'dog'),
(3, 'b','05/20/2022'::date, 200, 'mouse'),
(3, 'b','05/20/2022'::date, 300, 'rabbit'),
(3, 'b','05/20/2022'::date, 50, 'squirrel'),
(4, 'b','07/20/2022'::date, 40, 'mod'), -- floater
(5, 'c','02/02/2020'::date, 100, 'cat'),
(5, 'c','02/02/2020'::date, 150, 'dog'),
(5, 'c','02/02/2020'::date, 200, 'mouse'),
(5, 'c','02/02/2020'::date, 300, 'rabbit'),
(6, 'c','08/01/2020'::date, 50, 'mod'), -- floater
(7, 'c','12/25/2020'::date, 40, 'mod'), -- floater
(8, 'd','01/15/2021'::date, 10, 'cat'),
(8, 'd','01/15/2021'::date, 15, 'dog'),
(8, 'd','01/15/2021'::date, 20, 'mouse'),
(8, 'd','01/15/2021'::date, 30, 'rabbit'),
(8, 'd','01/15/2021'::date, 5, 'squirrel'),
(8, 'd','01/15/2021'::date, 4, 'mod'),
(9, 'e','02/10/2020'::date, 100, 'cat'),
(9, 'e','02/10/2020'::date, 150, 'dog'),
(9, 'e','02/10/2020'::date, 200, 'mouse'),
(9, 'e','02/10/2020'::date, 300, 'rabbit'),
(10, 'e','08/17/2020'::date, 50, 'mod'), -- floater
(11, 'e','12/15/2020'::date, 40, 'mod'), -- floater
(12, 'e','02/14/2021'::date, 40, 'mod'), -- floater
(13, 'c','04/09/2022'::date, 0, 'mouse'),
(13, 'c','04/09/2022'::date, 0, 'rabbit'),
(13, 'c','04/09/2022'::date, 50, 'mod') -- floater because other values for transaction_id sum to 0
) as tbl (transaction_id, customer_id, transaction_date, amount, type)
), LAGGED as
(
select   TRANSACTION_ID
,TYPE
,lag(iff(type = 'mod' or AMOUNT = 0, null, TRANSACTION_ID)) ignore nulls over (order by TRANSACTION_ID) PREVIOUS_TRANSACTION_ID
from data
)
select TRANSACTION_ID, PREVIOUS_TRANSACTION_ID 
from LAGGED
where type = 'mod' and TRANSACTION_ID <> PREVIOUS_TRANSACTION_ID
;
TRANSACTION_ID
21
43
65
75
109
119
129
139

这个例子很复杂,但据我所知:

  • 您有一个表,其中每个记录都是一个事务
  • 一个事务可以与另一个事务配对
  • 并非所有交易都配对
  • 您希望将每个未配对的事务与之前的第一个配对事务配对
  • 并不是每个未配对的事务前面都有一个配对的事务

我们可以把这个问题重述为";将每个记录与它之前满足某个条件的最后一个记录配对;。这是一个联接(带有一个涉及行号的谓词(。

与其使用复杂的LAG查询,不如这样处理:

  • 将行号分配给具有row_number的列
  • 只接受带有WHERE的成对事务,并将它们放入CTE
  • 只接受未配对的交易并将其放入CTE
  • 将每一个未配对的行与前面每一个配对的行匹配:select unpaired.id as x, paired.id as y, paired.row_num as y_n ... from paired join unpaired on paired.row_num < unpaired.row_num
  • 对于每个x,只保留具有最高y_n的行SQL:为每个唯一id获取最后一行

您提到了jinja,它是一个Python模板框架。它对于动态生成查询最有用。我不会尝试使用它,因为使用静态查询可以很容易地解决您的问题。动态生成只会增加不必要的复杂性。

最新更新