SQL-统计具有多个事务并且至少有一个事务在另一个事务的7天间隔内完成的用户



数据集任务如下:统计拥有多个事务并且至少有一个事务在7天内完成的用户。

数据集结构:行、用户ID、订单ID、日期

日期格式为YYYY-MM-DDTHH:MM:SS示例:2016-09-16T11:32:06

我已经完成了第一部分(计算有多个事务的用户(,但我不知道如何在同一查询中完成第二部分。我将感谢你的帮助。

这是控制台:

query = '''
SELECT COUNT(*) 
FROM 
(SELECT userId FROM `dataset` GROUP BY userId HAVING COUNT(orderId) > 1)
'''
project_id = 'acdefg'
df = pd.io.gbq.read_gbq(query, project_id=project_id, dialect='standard')
display(df)

要解决这个问题,您需要能够将每个记录与前一个记录进行比较:同一用户的最后一个订单是什么时候。这提示了分区和窗口函数的使用,在本例中为LAG

解决该问题的一种可能方法是按用户组织记录按orderDate排序,然后查看上面的记录:

WITH intermediate_table AS (
SELECT
userId,
orderDate,
LAG(orderDate) 
OVER (PARTITION BY userId ORDER BY orderDate) -- this is where we pick the orderDate of the record right above, once the orders are organized by userId and ordered by orderDate
FROM `dataset.table`
)

SELECT userId
FROM intermediate_table
WHERE DATE_DIFF(orderDate, previous_order, DAY) <= 7
GROUP BY userId

一旦orderDateprevious_order信息被收集到同一记录中,就很容易对它们进行比较,看看两者之间的时间是否少于7天。

(GROUP BY用于在结果表中只返回一次userId(

这可能是您所需要的:

-- for each order calculate the days since that customer's last order
order_profiler AS (
SELECT
orderId,
orderDate,
custId,
DATE_DIFF(orderDate, LAG(orderDate) OVER (PARTITION BY custId ORDER BY orderDate), day) AS order_latency_days,
FROM
`dataset.table` 
)
SELECT 
custId,
FROM order_profiler
WHERE order_latency_days <= 7
GROUP BY custId

最新更新