聚合 dd-mm-yyyy hh:mm:ss(按分钟)表示 R 中的组


mydata<-structure(list(lead_create = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("10.11.2017 4:47:26", 
                                          "10.11.2017 4:48:26", "10.11.2017 4:49:26"), class = "factor"), 
lead_id = c(24799522L, 24799522L, 24799522L, 24799522L, 24799522L, 
24799522L, 24799522L, 24799522L, 24799522L, 24799522L, 24799522L, 
24799522L, 24799523L, 24799523L, 24799524L, 24799524L, 24799524L, 
24799524L), webmaster_identifier = c(430L, 430L, 430L, 430L, 
 430L, 431L, 431L, 431L, 431L, 431L, 431L, 431L, 430L, 430L, 
 430L, 430L, 430L, 430L), product = structure(c(2L, 2L, 2L, 
                                                2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L
 ), .Label = c("gel", "Intoxic"), class = "factor"), lead_country = structure(c(1L, 
                                                                                1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                                                                1L, 1L), .Label = "Indonesia", class = "factor")), .Names = c("lead_create", 
                                                                                                                                              "lead_id", "webmaster_identifier", "product", "lead_country"), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                 -18L))

我不知道为什么,但在这个例子中lead_create因素!它是日期变量。

我需要组变量webmaster_identifier,产品,lead_country按分钟计算唯一lead_id的数量。 lead_create的日期格式为 dd-mm-yyyy hh:mm:ss 我需要数据在这样的数据帧中

lead_create lead_id webmaster_identifier product lead_country
1 10.11.2017 4:47       1                  430 Intoxic    Indonesia
2 10.11.2017 4:47       1                  431 Intoxic    Indonesia
3 10.11.2017 4:48       1                  430     gel    Indonesia
4 10.11.2017 4:49       1                  430     gel    Indonesia

对于 10.11.2017 4:47:00-10.11.2017 4:47:59 的时间段,对于webmaster=430product =intoxiclead_country=Indonesia只是一个独特的lead_id。

对于 10.11.2017 4:47:00-10.11.2017 4:47:59 的时间段,对于webmaster=431product =intoxiclead_country=Indonesia也只是一个独特的lead_id。

对于 10.11.2017 4:48:00-10.11.2017 4:48:59 的时间段,对于webmaster=430product =gellead_country=Indonesia只是一个独特的lead_id。

对于 10.11.2017 4:49:00-10.11.2017 4:49:59 的时间段,对于webmaster=430product =gellead_country=Indonesia只是一个独特的lead_id。

如何创建这样的数据帧?

看起来我们需要删除"lead_create"中的后缀字符串,然后获取distinct

library(dplyr)
library(stringr)
mydata %>%  
mutate(lead_create = str_remove(lead_create, ":\d+$")) %>% 
distinct  %>%
mutate(lead_id = group_indices(., lead_country))
#     lead_create lead_id webmaster_identifier product lead_country
#1 10.11.2017 4:47       1                  430 Intoxic    Indonesia
#2 10.11.2017 4:47       1                  431 Intoxic    Indonesia
#3 10.11.2017 4:48       1                  430     gel    Indonesia
#4 10.11.2017 4:49       1                  430     gel    Indonesia

最新更新