如何使用 R 检测一列字符中的模式和频率?



我有一个df,它显示了人们的"活动链",看起来像这样(问题底部的狙击(:

head(agents)
id                                                                                                                                                                leg_activity
1   9                                                                                      home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home
2  10 home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home
3  11                                                                                                                                                      home, work, adpt, home
4  96                                                                                                                                home, car, work, car, home, work, adpt, home
5  97                              home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home
6 101                                       home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home

我感兴趣的是检测adpt发生的模式。最简单的方法是使用count()函数,它给了我一个频率表作为输出。 不幸的是,这个结果会误导。

这是它的样子:

x                                 freq
home, adpt, work, adpt, home      2071
home, adpt, shop, adpt, home      653
home, adpt, education, adpt, home 545
home, pt, work, adpt, home        492
home, adpt, work, pt, home        468
home, adpt, work, home            283

这种方法的问题在于我无法检测较长活动链中的模式;例如:

home, adpt, education, adpt, education, adpt, home, car, work, car, home, shop, adpt, home

这种情况一开始有一个活动链,这是非常频繁的,但随着进一步活动的开展,它不算在count函数中。

有没有办法使用计数函数,同时考虑细胞内部发生的事情? 因此,有一个表格来显示所有可能的组合及其频率会很有趣,如下所示:

x                                freq
home, adpt, home                 10
home, adpt, home, pt, work, home 4
home, pt, work, home             2

非常感谢您的帮助!

数据:

structure(list(id = c(9L, 10L, 11L, 96L, 97L, 101L, 103L, 248L, 
499L, 1044L, 1215L, 1238L, 1458L, 1569L, 1615L, 1626L, 1734L, 
1735L, 1790L, 1912L, 9040L, 14858L, 14859L, 14967L, 15011L, 15012L, 
15015L, 15045L, 15050L, 15058L, 15060L, 15086L, 15088L, 15094L, 
15109L, 15113L, 15152L, 15157L, 15192L, 15193L, 15222L, 15230L, 
15231L, 15234L, 15235L, 15237L, 15256L, 15257L, 15258L, 15269L, 
15270L, 15318L, 15319L, 15338L, 15369L, 15371L, 15396L, 15397L, 
15399L, 15404L, 15505L, 15506L, 15515L, 15516L, 15525L, 15542L, 
15593L, 15602L, 15608L, 15643L, 15667L, 15727L, 15728L, 15729L, 
15752L, 15775L, 15808L, 15851L, 15869L, 15881L, 15882L, 15960L, 
15962L, 15966L, 16058L, 16107L, 16174L, 16229L, 16237L, 16238L, 
16291L, 16333L, 16416L, 16418L, 16449L, 16450L, 16451L, 16491L, 
16506L, 16508L), leg_activity = c("home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home", 
"home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home", 
"home, work, adpt, home", "home, car, work, car, home, work, adpt, home", 
"home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home", 
"home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home", 
"home, adpt, work, adpt, home, walk, other, pt, home", "home, adpt, work, walk, home, adpt, work, walk, home", 
"home, adpt, leisure, adpt, home, bike, outside, bike, home", 
"home, pt, work, adpt, home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, outside, car, work, car, work, car, home", 
"home, work, leisure, adpt, home", "home, outside, pt, home, adpt, leisure, adpt, home", 
"home, car_passenger, leisure, walk, work, walk, leisure, walk, work, adpt, home, walk, home", 
"home, adpt, work, walk, work, walk, work, pt, home", "home, car, work, pt, leisure, adpt, work, car, home, car, home", 
"home, adpt, other, adpt, home, car, home", "home, adpt, other, adpt, home", 
"home, education, walk, shop, walk, education, pt, outside, home, adpt, leisure, adpt, home", 
"home, adpt, work, adpt, home, walk, home", "home, adpt, work, pt, leisure, adpt, work, adpt, work, adpt, home, adpt, other, walk, home", 
"home, adpt, work, adpt, home, adpt, work, adpt, home, walk, leisure, walk, home", 
"home, adpt, work, adpt, home, work, adpt, home, walk, leisure, walk, home", 
"home, adpt, work, adpt, home, car_passenger, outside, car_passenger, leisure, car_passenger, home, car_passenger, home", 
"home, adpt, other, adpt, home, car, work, car, home", "home, adpt, education, adpt, leisure, adpt, home, walk, leisure, walk, home", 
"home, car_passenger, other, pt, home, walk, other, walk, home, car_passenger, other, walk, home, adpt, other, adpt, home", 
"home, work, pt, work, adpt, work, adpt, home", "home, adpt, leisure, adpt, home, car, shop, car, other, car, home", 
"home, adpt, work, adpt, home, walk, other, adpt, home", "home, adpt, work, adpt, home, car_passenger, leisure, car_passenger, home", 
"home, car, other, car, home, adpt, shop, adpt, home", "home, pt, work, adpt, home", 
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home", 
"home, walk, education, adpt, home, walk, education, walk, home, bike, leisure, bike, home", 
"home, adpt, shop, adpt, home, car, home", "home, adpt, leisure, walk, leisure, walk, leisure, adpt, home", 
"home, adpt, shop, pt, home, adpt, other, adpt, home", "home, adpt, other, adpt, home, car_passenger, leisure, walk, home", 
"home, adpt, work, adpt, home, car_passenger, shop, car_passenger, home", 
"home, adpt, other, adpt, work, adpt, home", "home, adpt, work, adpt, home, adpt, other, walk, shop, walk, home, car, outside, car, outside, car, outside, car, home", 
"home, adpt, other, adpt, home", "home, adpt, education, adpt, home, adpt, education, adpt, home", 
"home, pt, work, adpt, work, adpt, work, adpt, work, adpt, home, adpt, work, adpt, home", 
"home, walk, other, car_passenger, education, walk, home, car_passenger, education, adpt, home", 
"home, walk, shop, walk, home, walk, leisure, adpt, leisure, adpt, home", 
"home, adpt, work, adpt, home, walk, shop, walk, home, walk, leisure, walk, home, walk, home", 
"home, adpt, leisure, adpt, home", "home, walk, leisure, walk, home, adpt, other, adpt, shop, walk, leisure, walk, home", 
"home, pt, leisure, adpt, home, pt, outside, pt, home, bike, leisure, bike, home", 
"home, pt, outside, pt, home, walk, home, walk, other, adpt, shop, pt, home, car_passenger, leisure, adpt, home", 
"home, adpt, work, adpt, home, adpt, shop, adpt, work, adpt, home", 
"home, adpt, shop, adpt, other, walk, home", "home, walk, other, walk, home, walk, home, adpt, other, adpt, home, adpt, shop, adpt, home, car, other, car, home, adpt, other, adpt, home", 
"home, adpt, leisure, pt, home", "home, leisure, adpt, home", 
"home, adpt, leisure, pt, shop, walk, home, walk, shop, walk, home", 
"home, car, outside, car, outside, leisure, car, outside, car, outside, car, home, adpt, other, adpt, home", 
"home, adpt, work, adpt, shop, walk, home", "home, adpt, other, walk, work, adpt, home, adpt, other, adpt, work, adpt, home, adpt, leisure, adpt, home", 
"home, adpt, leisure, adpt, home, car, shop, car, home", "home, walk, shop, adpt, home, car, other, car, home, adpt, other, adpt, home", 
"home, walk, leisure, walk, home, adpt, work, adpt, home", "home, adpt, work, adpt, home", 
"home, adpt, leisure, pt, shop, adpt, home, adpt, leisure, walk, home", 
"home, walk, other, walk, leisure, walk, home, car, leisure, car, home, walk, leisure, adpt, home", 
"home, adpt, work, adpt, home", "home, walk, leisure, walk, home, adpt, leisure, adpt, home, adpt, leisure, walk, home", 
"home, walk, home, walk, shop, walk, home, walk, leisure, walk, home, adpt, other, adpt, home", 
"home, car_passenger, outside, car_passenger, outside, car_passenger, home, adpt, other, adpt, home", 
"home, walk, education, adpt, home", "home, adpt, education, walk, home, bike, education, bike, home", 
"home, adpt, other, adpt, home, adpt, shop, pt, home", "home, adpt, other, adpt, shop, walk, home, adpt, leisure, car_passenger, home", 
"home, adpt, work, adpt, other, adpt, home", "home, adpt, work, adpt, home", 
"home, adpt, work, adpt, home, walk, home", "home, car, work, adpt, leisure, adpt, work, car, home", 
"home, adpt, shop, adpt, home, car, other, car, home, car_passenger, outside, car_passenger, home", 
"home, adpt, work, pt, home, car, shop, car, home", "home, walk, other, adpt, work, adpt, shop, adpt, shop, adpt, home", 
"home, adpt, leisure, adpt, shop, adpt, leisure, pt, home", "home, adpt, leisure, adpt, shop, adpt, home", 
"home, car, outside, car, outside, car, outside, car, outside, car, home, adpt, education, pt, home", 
"home, adpt, work, adpt, home", "home, adpt, shop, adpt, home", 
"home, adpt, education, adpt, home, adpt, education, adpt, home", 
"home, adpt, other, adpt, other, walk, leisure, adpt, other, adpt, home", 
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, other, car, home", 
"home, car, work, car, shop, car, home, adpt, work, adpt, home, car, home", 
"home, walk, other, walk, education, adpt, home, adpt, education, walk, home, walk, home", 
"home, adpt, shop, walk, leisure, adpt, home", "home, adpt, shop, walk, home, adpt, work, adpt, home", 
"home, adpt, leisure, adpt, shop, walk, home", "home, walk, other, adpt, shop, walk, home, walk, other, walk, home, walk, other, walk, other, adpt, home", 
"home, adpt, education, walk, home, walk, education, walk, home, walk, home", 
"home, bike, education, bike, home, adpt, education, adpt, home, walk, home"
)), row.names = c(NA, 100L), class = "data.frame")

我不太确定您到底想做什么,但我知道您对检测活动adpt发生的模式感兴趣。这通常在 NLP 中完成,下面是使用tidytext包的解决方案。我将leg_activity列拆分为所谓的n-grams,即我按连续的单词序列分解文本。两个连续词的序列称为bi-gram、三个连续词tri-gram等。当我们计算这些n-grams时,我们了解到哪些活动最常在adpt之前,哪些活动最常在adpt之后。

以下是为bi-grams执行此操作的方法:

df %>% 
unnest_tokens(bigram, leg_activity, token = "ngrams", n = 2) %>% 
filter(str_detect(bigram, "adpt")) %>% 
count(bigram, sort = TRUE)
bigram   n
1       home adpt 100
2       adpt home  97
3       work adpt  51
4       adpt work  48
5    leisure adpt  27
6      adpt other  26
7      other adpt  26
8    adpt leisure  24
9       adpt shop  22
10      shop adpt  13
11 adpt education  10
12 education adpt  10
因此,adpt 最常以"home">

开头,而"home"也是最常紧跟在"adpt"之后的内容。如果我们对连续发生的三个活动感兴趣,包括"adpt",我们可以对tri-grams做同样的事情:

df %>% 
unnest_tokens(bigram, leg_activity, token = "ngrams", n = 3) %>%  #n is the only thing that changed
filter(str_detect(bigram, "adpt")) %>% 
count(bigram, sort = TRUE)
bigram  n
1                work adpt home 42
2                adpt work adpt 40
3                home adpt work 36
4               home adpt other 22
5               adpt other adpt 21
6             home adpt leisure 20
7             leisure adpt home 19
8               other adpt home 18
9             adpt leisure adpt 16
10               adpt home adpt 15
11               home adpt shop 12
12                adpt home car 11
13               adpt home walk 11
14               adpt shop adpt 11
15          home adpt education 10
16          education adpt home  9
[list continues]

这个列表要长得多,因为现在有更多可能的组合。如果您想了解更多信息,这里有一个关于 n-gram 的好教程的链接。这是你想做的吗?

最新更新