数据集任务如下:统计拥有多个事务并且至少有一个事务在7天内完成的用户。
数据集结构:行、用户ID、订单ID、日期
日期格式为YYYY-MM-DDTHH:MM:SS示例:2016-09-16T11:32:06
我已经完成了第一部分(计算有多个事务的用户(,但我不知道如何在同一查询中完成第二部分。我将感谢你的帮助。
这是控制台:
query = '''
SELECT COUNT(*)
FROM
(SELECT userId FROM `dataset` GROUP BY userId HAVING COUNT(orderId) > 1)
'''
project_id = 'acdefg'
df = pd.io.gbq.read_gbq(query, project_id=project_id, dialect='standard')
display(df)
要解决这个问题,您需要能够将每个记录与前一个记录进行比较:同一用户的最后一个订单是什么时候。这提示了分区和窗口函数的使用,在本例中为LAG
。
解决该问题的一种可能方法是按用户组织记录,按orderDate排序,然后查看上面的记录:
WITH intermediate_table AS (
SELECT
userId,
orderDate,
LAG(orderDate)
OVER (PARTITION BY userId ORDER BY orderDate) -- this is where we pick the orderDate of the record right above, once the orders are organized by userId and ordered by orderDate
FROM `dataset.table`
)
SELECT userId
FROM intermediate_table
WHERE DATE_DIFF(orderDate, previous_order, DAY) <= 7
GROUP BY userId
一旦orderDate
和previous_order
信息被收集到同一记录中,就很容易对它们进行比较,看看两者之间的时间是否少于7天。
(GROUP BY用于在结果表中只返回一次userId(
这可能是您所需要的:
-- for each order calculate the days since that customer's last order
order_profiler AS (
SELECT
orderId,
orderDate,
custId,
DATE_DIFF(orderDate, LAG(orderDate) OVER (PARTITION BY custId ORDER BY orderDate), day) AS order_latency_days,
FROM
`dataset.table`
)
SELECT
custId,
FROM order_profiler
WHERE order_latency_days <= 7
GROUP BY custId