如何查找和计算满足序列的出现次数?



我有一个从 1/2016 到 2/2017 的数据集,用于用户、月份、家庭位置、访问位置和访问计数。用户每月可以访问多个位置,也可以不访问任何位置。数据集每月包含多行,因为用户可以访问不同的位置。

这是数据,然后我将引导您完成下面的Rextester链接。

| user_id | year_month | homelocid | visitlocid | fact_count |
|---------|------------|-----------|------------|------------|
| 1234    | 201601     | L9900     | L9987      | 8          |
| 1234    | 201601     | L9900     | L4456      | 1          |
| 1234    | 201602     | L9900     | L9987      | 5          |
| 1234    | 201603     | L9900     | L9987      | 1          |
| 1234    | 201604     | L9900     | L9987      | 4          |
| 1234    | 201605     | L9900     | L9987      | 3          |
| 1234    | 201605     | L9900     | L4456      | 3          |
| 1234    | 201605     | L9900     | L5567      | 26         |
| 1234    | 201606     | L9900     | L4456      | 1          |
| 1234    | 201606     | L9900     | L9987      | 21         |
| 1234    | 201606     | L9900     | L6678      | 2          |
| 1234    | 201606     | L9900     | L5567      | 1          |
| 1234    | 201607     | L9900     | L9987      | 6          |
| 1234    | 201607     | L9900     | L5567      | 22         |
| 1234    | 201607     | L9900     | L6678      | 25         |
| 1234    | 201608     | NULL      | NULL       | NULL       |
| 1234    | 201609     | L9900     | L9987      | 1          |
| 1234    | 201609     | L9900     | L4456      | 3          |
| 1234    | 201610     | L9900     | L9987      | 1          |
| 1234    | 201611     | NULL      | NULL       | NULL       |
| 1234    | 201612     | NULL      | NULL       | NULL       |
| 1234    | 201701     | NULL      | NULL       | NULL       |
| 1234    | 201702     | L9900     | L9987      | 6          |
| 1234    | 201702     | L9900     | L2334      | 1          |
| 5556    | 201601     | L9900     | L3445      | 1          |
| 5556    | 201602     | L9900     | L3445      | 3          |
| 5556    | 201603     | L9900     | L3445      | 1          |
| 5556    | 201604     | L9900     | L3445      | 21         |
| 5556    | 201605     | L9900     | L3445      | 1          |
| 5556    | 201606     | L9900     | L3445      | 2          |
| 5556    | 201607     | NULL      | NULL       | NULL       |
| 5556    | 201608     | NULL      | NULL       | NULL       |
| 5556    | 201609     | L9900     | L3445      | 1          |
| 5556    | 201610     | NULL      | NULL       | NULL       |
| 5556    | 201611     | NULL      | NULL       | NULL       |
| 5556    | 201612     | L9900     | L3445      | 1          |
| 5556    | 201701     | NULL      | NULL       | NULL       |
| 5556    | 201702     | L9900     | L3445      | 1          |
| 9988    | 201601     | L9900     | L2334      | 27         |
| 9988    | 201602     | L9900     | L3445      | 22         |
| 9988    | 201602     | L9900     | L9987      | 1          |
| 9988    | 201602     | L9900     | L2334      | 2          |
| 9988    | 201603     | L9900     | L4556      | 1          |
| 9988    | 201603     | L9900     | L2334      | 2          |
| 9988    | 201603     | L9900     | L3445      | 21         |
| 9988    | 201603     | L9900     | L9987      | 1          |
| 9988    | 201604     | L9900     | L3445      | 25         |
| 9988    | 201604     | L9900     | L2334      | 1          |
| 9988    | 201604     | L9900     | L4556      | 1          |
| 9988    | 201605     | L9900     | L3445      | 26         |
| 9988    | 201605     | L9900     | L2334      | 1          |
| 9988    | 201606     | L9900     | L3445      | 22         |
| 9988    | 201607     | L9900     | L3445      | 5          |
| 9988    | 201608     | L9900     | L3445      | 6          |
| 9988    | 201609     | L9900     | L3445      | 11         |
| 9988    | 201610     | L9900     | L3445      | 17         |
| 9988    | 201611     | L9900     | L3445      | 9          |
| 9988    | 201612     | L9900     | L2334      | 1          |
| 9988    | 201612     | L9900     | L3445      | 14         |
| 9988    | 201701     | L9900     | L3445      | 14         |
| 9988    | 201701     | L9900     | L2334      | 2          |
| 9988    | 201702     | L9900     | L4556      | 1          |
| 9988    | 201702     | L9900     | L3445      | 9          |

我正在尝试计算用户拥有的次数:

  1. 1个月内超过20 fact_count
  2. 连续超过 20 fact_count 2 个月
  3. 连续超过 20 fact_count 3 个月

我还需要从两个层面总结数据:

  1. 在user_id级别(分母 = # 用户,分子 = # 连续 1、2 和 3 个月fact_count 20 个>用户)
  2. 在user_id和月级别(分母 = 每个用户的 # 个月 (14),分子 = #>连续 1、2 和 3 fact_count 20 的用户)

这是我从数据集中寻找的所需结果(注意:不是查询输出,只是结果):

+-------------------+------------+------------------+
|                   | User level | User-month level |
+-------------------+------------+------------------+
| 1 month           | 3/3        | 10/42            |
+-------------------+------------+------------------+
| 2 months in a row | 2/3        | 4/42             |
+-------------------+------------+------------------+
| 3 months in a row | 2/3        | 3/42             |
+-------------------+------------+------------------+

以下是我转换数据以正确总结数据的方法,您可以在此Rextester链接中找到该方法:http://rextester.com/DBF40287

  1. mem_month CTE 使用 CASE 根据是否超过 fact_count>20 创建 1 或 0 作为"MET_NOTMET"字段。此时,有多行反映不同的访问位置 ID。
  2. mem_month_rank CTE 使用 RANK 通过对 mem_month 中的 CASE 语句进行排名来查找用户的fact_count是否为>20。如果用户在一个月内访问了三个位置,其中只有一个位置访问了 fact_count>20,这将识别违规行。

下面是这些查询的输出,更接近规范化表:

| mem_month                        |   | mem_month_rank                         |
|--------|------------|------------|---|----------------------------------------|
| user_id| year_month | met_notmet |   | user_id| year_month | met_notmet | rnk |
| 1234   | 201601     | 0          |   | 1234   | 201601     | 0          | 1   |
| 1234   | 201601     | 0          |   | 1234   | 201602     | 0          | 1   |
| 1234   | 201602     | 0          |   | 1234   | 201603     | 0          | 1   |
| 1234   | 201603     | 0          |   | 1234   | 201604     | 0          | 1   |
| 1234   | 201604     | 0          |   | 1234   | 201605     | 0          | 1   |
| 1234   | 201605     | 0          |   | 1234   | 201605     | 1          | 2   |
| 1234   | 201605     | 1          |   | 1234   | 201606     | 0          | 1   |
| 1234   | 201606     | 0          |   | 1234   | 201606     | 1          | 2   |
| 1234   | 201606     | 1          |   | 1234   | 201607     | 0          | 1   |
| 1234   | 201606     | 0          |   | 1234   | 201607     | 1          | 2   |
| 1234   | 201607     | 1          |   | 1234   | 201608     | 0          | 1   |
| 1234   | 201607     | 0          |   | 1234   | 201609     | 0          | 1   |
| 1234   | 201607     | 1          |   | 1234   | 201610     | 0          | 1   |
| 1234   | 201608     | 0          |   | 1234   | 201611     | 0          | 1   |
| 1234   | 201609     | 0          |   | 1234   | 201612     | 0          | 1   |
| 1234   | 201609     | 0          |   | 1234   | 201701     | 0          | 1   |
| 1234   | 201610     | 0          |   | 1234   | 201702     | 0          | 1   |
| 1234   | 201611     | 0          |
| 1234   | 201612     | 0          |
| 1234   | 201701     | 0          |
| 1234   | 201702     | 0          |
  1. 接下来,mem_month_norm CTE 选择mem_month_rank,并在内部联接到用户、月份和 max(rank) 的子查询,创建一个每个用户和每月 1 行的规范化表,其中包含他们是否在该月的任何位置fact_count了违规>20

(列表分隔符)

| user_id | year_month | met_notmet |
|---------|------------|------------|
| 1234    | 201601     | 0          |
| 1234    | 201602     | 0          |
| 1234    | 201603     | 0          |
| 1234    | 201604     | 0          |
| 1234    | 201605     | 1          |
| 1234    | 201606     | 1          |
| 1234    | 201607     | 1          |
| 1234    | 201608     | 0          |
| 1234    | 201609     | 0          |
| 1234    | 201610     | 0          |
| 1234    | 201611     | 0          |
| 1234    | 201612     | 0          |
| 1234    | 201701     | 0          |
| 1234    | 201702     | 0          |

好的 - 这就是我需要你帮助的地方:我不知道计算连续违规值出现次数的最佳方法。我认为需要发生的是当用户连续 2 个月有违规值时进行某种"重置"。

因此,例如,在计算时:

  1. 连续 2 个月 - user_id 1234 应计算一次,因为在 201605 和 201606 上连续有两个违规的 MET_NOTMET 值。如果他在201608中有一个违规值,我认为这种"重置"需要发生,以便他将连续 2 个月(从 07 到 08)再次被计算在内。
  2. 连续 3 个月 - user_id 1234 应计算一次,用于从 (201605 到 201608 的三个MET_NOTMET违规值。同样,如果他从 201605 到 201610 连续有 6 个违规值,我认为会有一个"重置",将每组 3 个违规值计为 1 次违规。

同样,链接是:http://rextester.com/DBF40287

如果您需要任何澄清或其他信息,请告诉我,我将修改帖子。 谢谢!

这是一个非常复杂的问题,不确定我是否理解它,但这是我对情况理解的第一个想法。

此查询应提供汇总数据级别 1 的结果:

WITH mem_months AS (
SELECT user_id, year_month, sum(fact_count) AS visits1
FROM tbl
GROUP BY 1,2
),
mem_consec AS (
SELECT user_id, year_month, visits1, visits2, visits3,
visits1 > 20 AS met1,
-- need case on these to account for nulls in the lag
CASE WHEN visits1 > 20 AND visits2 > 20 THEN TRUE ELSE FALSE END AS met2,
CASE WHEN visits1 > 20 AND visits2 > 20 AND visits3 > 20 THEN TRUE ELSE FALSE END AS met3
FROM (
SELECT user_id, year_month, visits1,
-- Seems like you might want lag instead of lead to the current "month".
-- Where 3 in a row was met, the attributed month
-- would be the 3rd in the sequence.
LAG(visits1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS visits2,
LAG(visits1, 2) OVER (PARTITION BY user_id ORDER BY year_month) AS visits3
FROM mem_months
) t
),
mem_compliance AS (
select user_id, bool_or(met1) as any1, bool_or(met2) as any2, bool_or(met3) as any3
from mem_consec
group by 1
)
select count(case when any1 then 1 else 0 end) / count(*) as user_level_1,
count(case when any2 then 1 else 0 end) / count(*) as user_level_2,
count(case when any3 then 1 else 0 end) / count(*) as user_level_3
from mem_compliance

并且,使用相同的 CTE,这将提供汇总数据级别 2:

select sum(case when met1 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_1,
sum(case when met2 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_2,
sum(case when met3 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_3
from mem_consec

在这一点上,我想我明白你的实际问题是什么......在level_2和level_3版本中,随着连续几个月的继续,计数会增加。对于每个 3 行,将计算两个 2 合一行。例如,连续四个,你会得到两个 3 和三个 2。

这是一种获取非重叠总和的方法。一定是一种简化它的方法,但也许这是一个开始。这个想法是寻找满足月份的运行的开始和结束,然后获得每次运行的长度并将其除以所需的设置大小(2 或 3)。未满足的记录、中间的满足记录和单个满足记录将被丢弃。

mem_consec_sums AS (
SELECT user_id, sum(mon2s) AS mon2s, sum(mon3s) AS mon3s
FROM (
SELECT user_id,
((end_idx - start_idx)/2)::INT AS mon2s,
((end_idx - start_idx)/3)::INT AS mon3s
FROM (
SELECT user_id, year_month, met_start, idx AS start_idx,
LEAD(idx, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS end_idx
FROM (
SELECT user_id, year_month, met1, idx,
CASE WHEN met1 AND pre_met IS NULL OR NOT pre_met) THEN TRUE ELSE FALSE END AS met_start,
CASE WHEN met1 AND post_met IS NULL OR NOT post_met) THEN TRUE ELSE FALSE END AS met_end
FROM (
SELECT user_id, year_month, met1,
LAG(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS pre_met,
LEAD(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS post_met,
ROW_NUMBER() OVER (PARTITION BY tvid ORDER BY year_month) AS idx
FROM mem_consec
)
) WHERE (met_start OR met_end) AND NOT (met_start AND met_end)
) WHERE met_start
) GROUP BY 1
)

相关内容

  • 没有找到相关文章

最新更新