我有一个从 1/2016 到 2/2017 的数据集,用于用户、月份、家庭位置、访问位置和访问计数。用户每月可以访问多个位置,也可以不访问任何位置。数据集每月包含多行,因为用户可以访问不同的位置。
这是数据,然后我将引导您完成下面的Rextester链接。
| user_id | year_month | homelocid | visitlocid | fact_count |
|---------|------------|-----------|------------|------------|
| 1234 | 201601 | L9900 | L9987 | 8 |
| 1234 | 201601 | L9900 | L4456 | 1 |
| 1234 | 201602 | L9900 | L9987 | 5 |
| 1234 | 201603 | L9900 | L9987 | 1 |
| 1234 | 201604 | L9900 | L9987 | 4 |
| 1234 | 201605 | L9900 | L9987 | 3 |
| 1234 | 201605 | L9900 | L4456 | 3 |
| 1234 | 201605 | L9900 | L5567 | 26 |
| 1234 | 201606 | L9900 | L4456 | 1 |
| 1234 | 201606 | L9900 | L9987 | 21 |
| 1234 | 201606 | L9900 | L6678 | 2 |
| 1234 | 201606 | L9900 | L5567 | 1 |
| 1234 | 201607 | L9900 | L9987 | 6 |
| 1234 | 201607 | L9900 | L5567 | 22 |
| 1234 | 201607 | L9900 | L6678 | 25 |
| 1234 | 201608 | NULL | NULL | NULL |
| 1234 | 201609 | L9900 | L9987 | 1 |
| 1234 | 201609 | L9900 | L4456 | 3 |
| 1234 | 201610 | L9900 | L9987 | 1 |
| 1234 | 201611 | NULL | NULL | NULL |
| 1234 | 201612 | NULL | NULL | NULL |
| 1234 | 201701 | NULL | NULL | NULL |
| 1234 | 201702 | L9900 | L9987 | 6 |
| 1234 | 201702 | L9900 | L2334 | 1 |
| 5556 | 201601 | L9900 | L3445 | 1 |
| 5556 | 201602 | L9900 | L3445 | 3 |
| 5556 | 201603 | L9900 | L3445 | 1 |
| 5556 | 201604 | L9900 | L3445 | 21 |
| 5556 | 201605 | L9900 | L3445 | 1 |
| 5556 | 201606 | L9900 | L3445 | 2 |
| 5556 | 201607 | NULL | NULL | NULL |
| 5556 | 201608 | NULL | NULL | NULL |
| 5556 | 201609 | L9900 | L3445 | 1 |
| 5556 | 201610 | NULL | NULL | NULL |
| 5556 | 201611 | NULL | NULL | NULL |
| 5556 | 201612 | L9900 | L3445 | 1 |
| 5556 | 201701 | NULL | NULL | NULL |
| 5556 | 201702 | L9900 | L3445 | 1 |
| 9988 | 201601 | L9900 | L2334 | 27 |
| 9988 | 201602 | L9900 | L3445 | 22 |
| 9988 | 201602 | L9900 | L9987 | 1 |
| 9988 | 201602 | L9900 | L2334 | 2 |
| 9988 | 201603 | L9900 | L4556 | 1 |
| 9988 | 201603 | L9900 | L2334 | 2 |
| 9988 | 201603 | L9900 | L3445 | 21 |
| 9988 | 201603 | L9900 | L9987 | 1 |
| 9988 | 201604 | L9900 | L3445 | 25 |
| 9988 | 201604 | L9900 | L2334 | 1 |
| 9988 | 201604 | L9900 | L4556 | 1 |
| 9988 | 201605 | L9900 | L3445 | 26 |
| 9988 | 201605 | L9900 | L2334 | 1 |
| 9988 | 201606 | L9900 | L3445 | 22 |
| 9988 | 201607 | L9900 | L3445 | 5 |
| 9988 | 201608 | L9900 | L3445 | 6 |
| 9988 | 201609 | L9900 | L3445 | 11 |
| 9988 | 201610 | L9900 | L3445 | 17 |
| 9988 | 201611 | L9900 | L3445 | 9 |
| 9988 | 201612 | L9900 | L2334 | 1 |
| 9988 | 201612 | L9900 | L3445 | 14 |
| 9988 | 201701 | L9900 | L3445 | 14 |
| 9988 | 201701 | L9900 | L2334 | 2 |
| 9988 | 201702 | L9900 | L4556 | 1 |
| 9988 | 201702 | L9900 | L3445 | 9 |
我正在尝试计算用户拥有的次数:
- 1个月内超过20 fact_count
- 连续超过 20 fact_count 2 个月
- 连续超过 20 fact_count 3 个月
我还需要从两个层面总结数据:
- 在user_id级别(分母 = # 用户,分子 = # 连续 1、2 和 3 个月fact_count 20 个>用户)
- 在user_id和月级别(分母 = 每个用户的 # 个月 (14),分子 = #>连续 1、2 和 3 fact_count 20 的用户)
这是我从数据集中寻找的所需结果(注意:不是查询输出,只是结果):
+-------------------+------------+------------------+
| | User level | User-month level |
+-------------------+------------+------------------+
| 1 month | 3/3 | 10/42 |
+-------------------+------------+------------------+
| 2 months in a row | 2/3 | 4/42 |
+-------------------+------------+------------------+
| 3 months in a row | 2/3 | 3/42 |
+-------------------+------------+------------------+
以下是我转换数据以正确总结数据的方法,您可以在此Rextester链接中找到该方法:http://rextester.com/DBF40287
- mem_month CTE 使用 CASE 根据是否超过 fact_count>20 创建 1 或 0 作为"MET_NOTMET"字段。此时,有多行反映不同的访问位置 ID。
- mem_month_rank CTE 使用 RANK 通过对 mem_month 中的 CASE 语句进行排名来查找用户的fact_count是否为>20。如果用户在一个月内访问了三个位置,其中只有一个位置访问了 fact_count>20,这将识别违规行。
下面是这些查询的输出,更接近规范化表:
| mem_month | | mem_month_rank |
|--------|------------|------------|---|----------------------------------------|
| user_id| year_month | met_notmet | | user_id| year_month | met_notmet | rnk |
| 1234 | 201601 | 0 | | 1234 | 201601 | 0 | 1 |
| 1234 | 201601 | 0 | | 1234 | 201602 | 0 | 1 |
| 1234 | 201602 | 0 | | 1234 | 201603 | 0 | 1 |
| 1234 | 201603 | 0 | | 1234 | 201604 | 0 | 1 |
| 1234 | 201604 | 0 | | 1234 | 201605 | 0 | 1 |
| 1234 | 201605 | 0 | | 1234 | 201605 | 1 | 2 |
| 1234 | 201605 | 1 | | 1234 | 201606 | 0 | 1 |
| 1234 | 201606 | 0 | | 1234 | 201606 | 1 | 2 |
| 1234 | 201606 | 1 | | 1234 | 201607 | 0 | 1 |
| 1234 | 201606 | 0 | | 1234 | 201607 | 1 | 2 |
| 1234 | 201607 | 1 | | 1234 | 201608 | 0 | 1 |
| 1234 | 201607 | 0 | | 1234 | 201609 | 0 | 1 |
| 1234 | 201607 | 1 | | 1234 | 201610 | 0 | 1 |
| 1234 | 201608 | 0 | | 1234 | 201611 | 0 | 1 |
| 1234 | 201609 | 0 | | 1234 | 201612 | 0 | 1 |
| 1234 | 201609 | 0 | | 1234 | 201701 | 0 | 1 |
| 1234 | 201610 | 0 | | 1234 | 201702 | 0 | 1 |
| 1234 | 201611 | 0 |
| 1234 | 201612 | 0 |
| 1234 | 201701 | 0 |
| 1234 | 201702 | 0 |
- 接下来,mem_month_norm CTE 选择mem_month_rank,并在内部联接到用户、月份和 max(rank) 的子查询,创建一个每个用户和每月 1 行的规范化表,其中包含他们是否在该月的任何位置fact_count了违规>20 。
(列表分隔符)
| user_id | year_month | met_notmet |
|---------|------------|------------|
| 1234 | 201601 | 0 |
| 1234 | 201602 | 0 |
| 1234 | 201603 | 0 |
| 1234 | 201604 | 0 |
| 1234 | 201605 | 1 |
| 1234 | 201606 | 1 |
| 1234 | 201607 | 1 |
| 1234 | 201608 | 0 |
| 1234 | 201609 | 0 |
| 1234 | 201610 | 0 |
| 1234 | 201611 | 0 |
| 1234 | 201612 | 0 |
| 1234 | 201701 | 0 |
| 1234 | 201702 | 0 |
好的 - 这就是我需要你帮助的地方:我不知道计算连续违规值出现次数的最佳方法。我认为需要发生的是当用户连续 2 个月有违规值时进行某种"重置"。
因此,例如,在计算时:
- 连续 2 个月 - user_id 1234 应计算一次,因为在 201605 和 201606 上连续有两个违规的 MET_NOTMET 值。如果他在201608中有一个违规值,我认为这种"重置"需要发生,以便他将连续 2 个月(从 07 到 08)再次被计算在内。
- 连续 3 个月 - user_id 1234 应计算一次,用于从 (201605 到 201608 的三个MET_NOTMET违规值。同样,如果他从 201605 到 201610 连续有 6 个违规值,我认为会有一个"重置",将每组 3 个违规值计为 1 次违规。
同样,链接是:http://rextester.com/DBF40287
如果您需要任何澄清或其他信息,请告诉我,我将修改帖子。 谢谢!
这是一个非常复杂的问题,不确定我是否理解它,但这是我对情况理解的第一个想法。
此查询应提供汇总数据级别 1 的结果:
WITH mem_months AS (
SELECT user_id, year_month, sum(fact_count) AS visits1
FROM tbl
GROUP BY 1,2
),
mem_consec AS (
SELECT user_id, year_month, visits1, visits2, visits3,
visits1 > 20 AS met1,
-- need case on these to account for nulls in the lag
CASE WHEN visits1 > 20 AND visits2 > 20 THEN TRUE ELSE FALSE END AS met2,
CASE WHEN visits1 > 20 AND visits2 > 20 AND visits3 > 20 THEN TRUE ELSE FALSE END AS met3
FROM (
SELECT user_id, year_month, visits1,
-- Seems like you might want lag instead of lead to the current "month".
-- Where 3 in a row was met, the attributed month
-- would be the 3rd in the sequence.
LAG(visits1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS visits2,
LAG(visits1, 2) OVER (PARTITION BY user_id ORDER BY year_month) AS visits3
FROM mem_months
) t
),
mem_compliance AS (
select user_id, bool_or(met1) as any1, bool_or(met2) as any2, bool_or(met3) as any3
from mem_consec
group by 1
)
select count(case when any1 then 1 else 0 end) / count(*) as user_level_1,
count(case when any2 then 1 else 0 end) / count(*) as user_level_2,
count(case when any3 then 1 else 0 end) / count(*) as user_level_3
from mem_compliance
并且,使用相同的 CTE,这将提供汇总数据级别 2:
select sum(case when met1 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_1,
sum(case when met2 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_2,
sum(case when met3 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_3
from mem_consec
在这一点上,我想我明白你的实际问题是什么......在level_2和level_3版本中,随着连续几个月的继续,计数会增加。对于每个 3 行,将计算两个 2 合一行。例如,连续四个,你会得到两个 3 和三个 2。
这是一种获取非重叠总和的方法。一定是一种简化它的方法,但也许这是一个开始。这个想法是寻找满足月份的运行的开始和结束,然后获得每次运行的长度并将其除以所需的设置大小(2 或 3)。未满足的记录、中间的满足记录和单个满足记录将被丢弃。
mem_consec_sums AS (
SELECT user_id, sum(mon2s) AS mon2s, sum(mon3s) AS mon3s
FROM (
SELECT user_id,
((end_idx - start_idx)/2)::INT AS mon2s,
((end_idx - start_idx)/3)::INT AS mon3s
FROM (
SELECT user_id, year_month, met_start, idx AS start_idx,
LEAD(idx, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS end_idx
FROM (
SELECT user_id, year_month, met1, idx,
CASE WHEN met1 AND pre_met IS NULL OR NOT pre_met) THEN TRUE ELSE FALSE END AS met_start,
CASE WHEN met1 AND post_met IS NULL OR NOT post_met) THEN TRUE ELSE FALSE END AS met_end
FROM (
SELECT user_id, year_month, met1,
LAG(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS pre_met,
LEAD(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS post_met,
ROW_NUMBER() OVER (PARTITION BY tvid ORDER BY year_month) AS idx
FROM mem_consec
)
) WHERE (met_start OR met_end) AND NOT (met_start AND met_end)
) WHERE met_start
) GROUP BY 1
)