r-查找按天分隔的最常见值



我想看看每个参与者每天出现的频率最高的类别。每天都会出现多个类别,我想要一个新的列,说明特定参与者在特定日期发生的类别。

我有一列"user_id"、"date"和一列"category"(字符(。我应该使用哪个代码来添加一个新列,该列只说明特定用户在特定日期出现次数最多的类别?

dput:

structure(list(user_id = c("10257", "10580", "10280", "10202", "10275","10281"),
date = structure(c(1552521600, 1552003200, 1551139200,1551484800, 1552867200, 1552521600), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
better_category = c("Email", "Internet_Browser", "Instant_Messaging","News","Background_Process","Instant_Messaging")),
row.nams = c(176300L, 184332L, 469288L, 119462L, 112507L, 399236L), 
class = "data.frame")

让我们创建一些数据:

require(dplyr)
set.seed(100)
data<-data.frame(user_id=rep(c(1,2,3),10),date=rep(c("tuesday","wednesday","thursday"),each=10),category=(sample(c(1:3),30,replace=TRUE)))

如果我们arrange为了方便查看,我们可以得到这个:

data<-data %>% arrange(user_id,date)
data
user_id      date category
1        1  thursday        3
2        1  thursday        2
3        1  thursday        3
4        1   tuesday        1
5        1   tuesday        1
6        1   tuesday        3
7        1   tuesday        1
8        1 wednesday        1
9        1 wednesday        3
10       1 wednesday        2
11       2  thursday        2
12       2  thursday        1
13       2  thursday        2
14       2   tuesday        1
15       2   tuesday        2
16       2   tuesday        2
17       2 wednesday        2
18       2 wednesday        2
19       2 wednesday        1
20       2 wednesday        3
21       3  thursday        2
22       3  thursday        3
23       3  thursday        3
24       3  thursday        1
25       3   tuesday        2
26       3   tuesday        2
27       3   tuesday        2
28       3 wednesday        3
29       3 wednesday        3
30       3 wednesday        2

现在,我们将按user_id和date对其进行分组,并创建一个名为max的新列,从每个组中提取最频繁的类别。我们在`category上使用table,它为每个分组创建列的交叉表:

data %>% group_by(user_id,date) %>% 
dplyr::mutate(max=names(sort(table(category),decreasing=TRUE))[1])
# A tibble: 30 x 4
# Groups:   user_id, date [9]
user_id date      category max  
<dbl> <fct>        <int> <chr>
1       1 thursday         3 3    
2       1 thursday         2 3    
3       1 thursday         3 3    
4       1 tuesday          1 1    
5       1 tuesday          1 1    
6       1 tuesday          3 1    
7       1 tuesday          1 1    
8       1 wednesday        1 1    
9       1 wednesday        3 1    
10       1 wednesday        2 1    
# ... with 20 more rows

正如您所看到的,每个用户日分组都有自己的max。在向她展示的最后一个示例中(星期三1(,三个类别各有一个,因此选择了第一个,即1。

以下是使用dput数据的结果(其中每一行都有一个唯一的用户/日期配对(:

# A tibble: 6 x 4
# Groups:   user_id, date [6]
user_id date                better_category    max               
<fct>   <dttm>              <fct>              <chr>             
1 10257   2019-03-14 00:00:00 Email              Email             
2 10580   2019-03-08 00:00:00 Internet_Browser   Internet_Browser  
3 10280   2019-02-26 00:00:00 Instant_Messaging  Instant_Messaging 
4 10202   2019-03-02 00:00:00 News               News              
5 10275   2019-03-18 00:00:00 Background_Process Background_Process
6 10281   2019-03-14 00:00:00 Instant_Messaging  Instant_Messaging 

因此,我创建了一个相同的表,但将最后一行复制了两次,然后将其中一个类别更改为"新闻",并运行了相同的代码:

# A tibble: 8 x 4
# Groups:   user_id, date [6]
user_id date                better_category    max               
<chr>   <dttm>              <chr>              <chr>             
1 10257   2019-03-14 00:00:00 Email              Email             
2 10580   2019-03-08 00:00:00 Internet_Browser   Internet_Browser  
3 10280   2019-02-26 00:00:00 Instant_Messaging  Instant_Messaging 
4 10202   2019-03-02 00:00:00 News               News              
5 10275   2019-03-18 00:00:00 Background_Process Background_Process
6 10281   2019-03-14 00:00:00 News               Instant_Messaging 
7 10281   2019-03-14 00:00:00 Instant_Messaging  Instant_Messaging 
8 10281   2019-03-14 00:00:00 Instant_Messaging  Instant_Messaging 

注意最后三行。

最新更新