我已经使用read.table将数据从CSV导入RStudio数据的类型是";列表";看起来像这样:
客户端 | 目标1 | 目标2 | 时间|
---|---|---|---|
123 | 0 | 1 | 9:00 |
12 | 1 | 0 | 9:15 |
234 | 1 | 0 | 9:12 |
234 | 0 | 19:30 |
我们可以用所需的逻辑对按客户端分组的数据进行筛选,然后用summarise
和n_distinct()
进行筛选。将时间列更改为小时:分钟时间格式很重要,我们可以使用lubridate::hm()
library(dplyr)
d %>%
mutate(Time = lubridate::hm(Time)) %>%
group_by(Client) %>%
filter(any(Goal2==1 & Time > Time[Goal1==1])) %>%
ungroup() %>%
summarise(n = n_distinct(Client))
# A tibble: 1 × 1
n
<int>
1 1
这里有一些关键内容:
pivot_longer
以将不同的Goal
s获得到单个列中- 将
Time
转换为实际的时间格式,这样您就可以计算出哪个时间更早
library(tidyverse)
d <-
read.table(header = T,
text = "Client Goal1 Goal2 Time
123 0 1 9:00
123 1 0 9:15
234 1 0 9:12
234 0 1 9:30")
d %>%
pivot_longer(
starts_with("Goal"),
names_to = "Goal",
values_to = "is_goal",
names_prefix = "Goal"
) %>%
mutate(n_clients = length(unique(Client))) %>% # to keep for later as denominator of percentage
mutate(Goal = as.integer(Goal)) %>% # turn to numeric so you can assess who got both
filter(is_goal > 0) %>% # remove empty entries
mutate(Time = hm(Time)) %>% # convert to time to calculate what was first
group_by(Client) %>% # operate per-client
filter(sum(Goal) == 3) %>% # remove clients who didn't achieve both goals
mutate(in_order = Time[Goal == 1] < Time[Goal == 2]) %>% # score whether goal 2 was after 1
ungroup() %>%
filter(in_order) %>% # remove clients who were not in order
distinct(Client, n_clients) %>%
summarise(percentage = 100 * nrow(.) / n_clients) # summarize as percentage
#> # A tibble: 1 x 1
#> percentage
#> <dbl>
#> 1 50
创建于2021-12-28由reprex包(v0.3.0(