我有一个表格,如下所示。记录按user_id和event_time排序。
Row User_ID Event_Time Event_Type
1 1 2020-01-01 View
2 1 2020-01-02 Click
3 1 2020-01-03 Purchase
4 2 2020-02-01 View
5 2 2020-02-02 Click
6 2 2020-02-03 View
7 2 2020-02-04 Purchase
8 2 2020-02-11 View
9 2 2020-02-12 Purchase
10 2 2020-02-21 View
11 2 2020-02-22 Click
12 2 2020-02-23 Purchase
13 2 2020-02-27 View
14 2 2020-02-28 Click
15 3 2020-03-01 View
16 3 2020-03-02 Purchase
...
我想添加一个名为 Path 的新列来对非购买事件进行分类。用户的每个非购买事件都"属于"紧随其后的同一用户的立即购买事件,这意味着他们可以被视为一个子组。在每个子组中:
- 第一个非购买事件是介绍人(第 1、4、10 行(
- 最后一个非购买事件是"更近">(第 2、6、11 行(
- 介绍人和近距离之间的所有非购买事件都是影响者(第 5 行( 如果购买事件只有一个非购买事件与其
- 分组,则非购买事件仅为(第 8 行、第 15行(
- 在购买事件中填充 NULL(第 3、7、9、12、16 行(
- 如果非购买事件不属于任何购买事件,则填写 NULL(第 13、14 行(
因此,添加列后,表格应如下所示:
Row User_ID Event_Time Event_Type Path
1 1 2020-01-01 View Introducer
2 1 2020-01-02 Click Closer
3 1 2020-01-03 Purchase NULL
4 2 2020-02-01 View Introducer
5 2 2020-02-02 Click Influencer
6 2 2020-02-03 View Closer
7 2 2020-02-04 Purchase NULL
8 2 2020-02-11 View Only
9 2 2020-02-12 Purchase NULL
10 2 2020-02-21 View Introducer
11 2 2020-02-22 Click Closer
12 2 2020-02-23 Purchase NULL
13 2 2020-02-27 View NULL
14 2 2020-02-28 Click NULL
15 3 2020-03-01 View Only
16 3 2020-03-02 Purchase NULL
...
如果我进行自加入并添加一个新列来帮助确定用户上次购买每个事件的时间,则解决方案很容易。但是,我有超过 1 亿条记录,自加入效率不够高。执行最终会超时。所以我的问题是,有没有更有效的方法来添加这个新列?我正在考虑使用相关查询,但似乎无法理解它。
这遵循与 Nick 类似的方法,但我认为逻辑更简单:
WITH e AS (
SELECT e.*,
SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
(PARTITION BY User_ID ORDER BY Event_Time DESC) AS grp
FROM events e
),
en as (
SELECT e.*,
COUNT(*) OVER (PARTITION BY user_id, grp) as cnt,
ROW_NUMBER() OVER (PARTITION BY user_id, grp ORDER BY Event_Time) as seqnum
FROM e
)
SELECT en.*,
(CASE WHEN grp = 0 -- no purchase event
THEN NULL
WHEN Event_Type = 'Purchase' -- the event itself
THEN NULL
WHEN seqnum = 1 AND cnt = 2 -- the special case of "ONLY"
THEN 'Only'
WHEN seqnum = 1 -- The first event
THEN 'Introducer'
WHEN seqnum = cnt - 1 -- The penultimate event
THEN 'Closer'
ELSE 'Influencer'
END) as Path
FROM en
ORDER BY User_ID, Event_Time;
特别是,外部查询中的子查询是不必要的。grp = 0
查找可能没有购买的最后一组事件。 我还认为根据事件总数和顺序计数器编写逻辑更容易。
这是一个数据库<>小提琴。
如果您使用的是支持窗口函数的 DBMS,则可以使用几个 CTE 首先将行拆分为不同的购买,然后找到相对于每个购买的行号,最后根据您给出的条件计算Path
:
WITH purchases AS (
SELECT "Row", User_ID, Event_Time, Event_Type,
COALESCE(SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
(PARTITION BY User_ID ORDER BY Event_Time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) AS pnum
FROM events
),
prows AS (
SELECT "Row", User_ID, Event_Time, Event_Type, pnum,
ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time) AS rn,
ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time DESC) AS drn
FROM purchases
)
SELECT "Row", User_ID, Event_Time, Event_Type,
CASE WHEN Event_Type = 'Purchase' OR
NOT EXISTS (SELECT *
FROM prows r2
WHERE r2.User_ID = r1.User_ID
AND r2.pnum = r1.pnum
AND r2.Event_Type = 'Purchase') THEN NULL
WHEN rn = 1 AND drn = 2 THEN 'Only'
WHEN rn = 1 THEN 'Introducer'
WHEN drn = 2 THEN 'Closer'
ELSE 'Influencer'
END AS Path
FROM prows r1
ORDER BY User_ID, Event_Time
输出:
Row User_ID Event_Time Event_Type Path
1 1 2020-01-01 View Introducer
2 1 2020-01-02 Click Closer
3 1 2020-01-03 Purchase (null)
4 2 2020-02-01 View Introducer
5 2 2020-02-02 Click Influencer
6 2 2020-02-03 View Closer
7 2 2020-02-04 Purchase (null)
8 2 2020-02-11 View Only
9 2 2020-02-12 Purchase (null)
10 2 2020-02-21 View Introducer
11 2 2020-02-22 Click Closer
12 2 2020-02-23 Purchase (null)
13 2 2020-02-27 View (null)
14 2 2020-02-28 Click (null)
15 3 2020-03-01 View Only
16 3 2020-03-02 Purchase (null)
SQL Server demo on SQLFiddle.同样的查询也将在PostgreSQL和Oracle上运行。