我有一个df,它显示了人们的"活动链",看起来像这样(问题底部的狙击(:
head(agents)
id leg_activity
1 9 home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home
2 10 home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home
3 11 home, work, adpt, home
4 96 home, car, work, car, home, work, adpt, home
5 97 home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home
6 101 home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home
我感兴趣的是检测adpt
发生的模式。最简单的方法是使用count()
函数,它给了我一个频率表作为输出。 不幸的是,这个结果会误导。
这是它的样子:
x freq
home, adpt, work, adpt, home 2071
home, adpt, shop, adpt, home 653
home, adpt, education, adpt, home 545
home, pt, work, adpt, home 492
home, adpt, work, pt, home 468
home, adpt, work, home 283
这种方法的问题在于我无法检测较长活动链中的模式;例如:
home, adpt, education, adpt, education, adpt, home, car, work, car, home, shop, adpt, home
这种情况一开始有一个活动链,这是非常频繁的,但随着进一步活动的开展,它不算在count
函数中。
有没有办法使用计数函数,同时考虑细胞内部发生的事情? 因此,有一个表格来显示所有可能的组合及其频率会很有趣,如下所示:
x freq
home, adpt, home 10
home, adpt, home, pt, work, home 4
home, pt, work, home 2
非常感谢您的帮助!
数据:
structure(list(id = c(9L, 10L, 11L, 96L, 97L, 101L, 103L, 248L,
499L, 1044L, 1215L, 1238L, 1458L, 1569L, 1615L, 1626L, 1734L,
1735L, 1790L, 1912L, 9040L, 14858L, 14859L, 14967L, 15011L, 15012L,
15015L, 15045L, 15050L, 15058L, 15060L, 15086L, 15088L, 15094L,
15109L, 15113L, 15152L, 15157L, 15192L, 15193L, 15222L, 15230L,
15231L, 15234L, 15235L, 15237L, 15256L, 15257L, 15258L, 15269L,
15270L, 15318L, 15319L, 15338L, 15369L, 15371L, 15396L, 15397L,
15399L, 15404L, 15505L, 15506L, 15515L, 15516L, 15525L, 15542L,
15593L, 15602L, 15608L, 15643L, 15667L, 15727L, 15728L, 15729L,
15752L, 15775L, 15808L, 15851L, 15869L, 15881L, 15882L, 15960L,
15962L, 15966L, 16058L, 16107L, 16174L, 16229L, 16237L, 16238L,
16291L, 16333L, 16416L, 16418L, 16449L, 16450L, 16451L, 16491L,
16506L, 16508L), leg_activity = c("home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home",
"home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home",
"home, work, adpt, home", "home, car, work, car, home, work, adpt, home",
"home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home",
"home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home",
"home, adpt, work, adpt, home, walk, other, pt, home", "home, adpt, work, walk, home, adpt, work, walk, home",
"home, adpt, leisure, adpt, home, bike, outside, bike, home",
"home, pt, work, adpt, home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, outside, car, work, car, work, car, home",
"home, work, leisure, adpt, home", "home, outside, pt, home, adpt, leisure, adpt, home",
"home, car_passenger, leisure, walk, work, walk, leisure, walk, work, adpt, home, walk, home",
"home, adpt, work, walk, work, walk, work, pt, home", "home, car, work, pt, leisure, adpt, work, car, home, car, home",
"home, adpt, other, adpt, home, car, home", "home, adpt, other, adpt, home",
"home, education, walk, shop, walk, education, pt, outside, home, adpt, leisure, adpt, home",
"home, adpt, work, adpt, home, walk, home", "home, adpt, work, pt, leisure, adpt, work, adpt, work, adpt, home, adpt, other, walk, home",
"home, adpt, work, adpt, home, adpt, work, adpt, home, walk, leisure, walk, home",
"home, adpt, work, adpt, home, work, adpt, home, walk, leisure, walk, home",
"home, adpt, work, adpt, home, car_passenger, outside, car_passenger, leisure, car_passenger, home, car_passenger, home",
"home, adpt, other, adpt, home, car, work, car, home", "home, adpt, education, adpt, leisure, adpt, home, walk, leisure, walk, home",
"home, car_passenger, other, pt, home, walk, other, walk, home, car_passenger, other, walk, home, adpt, other, adpt, home",
"home, work, pt, work, adpt, work, adpt, home", "home, adpt, leisure, adpt, home, car, shop, car, other, car, home",
"home, adpt, work, adpt, home, walk, other, adpt, home", "home, adpt, work, adpt, home, car_passenger, leisure, car_passenger, home",
"home, car, other, car, home, adpt, shop, adpt, home", "home, pt, work, adpt, home",
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home",
"home, walk, education, adpt, home, walk, education, walk, home, bike, leisure, bike, home",
"home, adpt, shop, adpt, home, car, home", "home, adpt, leisure, walk, leisure, walk, leisure, adpt, home",
"home, adpt, shop, pt, home, adpt, other, adpt, home", "home, adpt, other, adpt, home, car_passenger, leisure, walk, home",
"home, adpt, work, adpt, home, car_passenger, shop, car_passenger, home",
"home, adpt, other, adpt, work, adpt, home", "home, adpt, work, adpt, home, adpt, other, walk, shop, walk, home, car, outside, car, outside, car, outside, car, home",
"home, adpt, other, adpt, home", "home, adpt, education, adpt, home, adpt, education, adpt, home",
"home, pt, work, adpt, work, adpt, work, adpt, work, adpt, home, adpt, work, adpt, home",
"home, walk, other, car_passenger, education, walk, home, car_passenger, education, adpt, home",
"home, walk, shop, walk, home, walk, leisure, adpt, leisure, adpt, home",
"home, adpt, work, adpt, home, walk, shop, walk, home, walk, leisure, walk, home, walk, home",
"home, adpt, leisure, adpt, home", "home, walk, leisure, walk, home, adpt, other, adpt, shop, walk, leisure, walk, home",
"home, pt, leisure, adpt, home, pt, outside, pt, home, bike, leisure, bike, home",
"home, pt, outside, pt, home, walk, home, walk, other, adpt, shop, pt, home, car_passenger, leisure, adpt, home",
"home, adpt, work, adpt, home, adpt, shop, adpt, work, adpt, home",
"home, adpt, shop, adpt, other, walk, home", "home, walk, other, walk, home, walk, home, adpt, other, adpt, home, adpt, shop, adpt, home, car, other, car, home, adpt, other, adpt, home",
"home, adpt, leisure, pt, home", "home, leisure, adpt, home",
"home, adpt, leisure, pt, shop, walk, home, walk, shop, walk, home",
"home, car, outside, car, outside, leisure, car, outside, car, outside, car, home, adpt, other, adpt, home",
"home, adpt, work, adpt, shop, walk, home", "home, adpt, other, walk, work, adpt, home, adpt, other, adpt, work, adpt, home, adpt, leisure, adpt, home",
"home, adpt, leisure, adpt, home, car, shop, car, home", "home, walk, shop, adpt, home, car, other, car, home, adpt, other, adpt, home",
"home, walk, leisure, walk, home, adpt, work, adpt, home", "home, adpt, work, adpt, home",
"home, adpt, leisure, pt, shop, adpt, home, adpt, leisure, walk, home",
"home, walk, other, walk, leisure, walk, home, car, leisure, car, home, walk, leisure, adpt, home",
"home, adpt, work, adpt, home", "home, walk, leisure, walk, home, adpt, leisure, adpt, home, adpt, leisure, walk, home",
"home, walk, home, walk, shop, walk, home, walk, leisure, walk, home, adpt, other, adpt, home",
"home, car_passenger, outside, car_passenger, outside, car_passenger, home, adpt, other, adpt, home",
"home, walk, education, adpt, home", "home, adpt, education, walk, home, bike, education, bike, home",
"home, adpt, other, adpt, home, adpt, shop, pt, home", "home, adpt, other, adpt, shop, walk, home, adpt, leisure, car_passenger, home",
"home, adpt, work, adpt, other, adpt, home", "home, adpt, work, adpt, home",
"home, adpt, work, adpt, home, walk, home", "home, car, work, adpt, leisure, adpt, work, car, home",
"home, adpt, shop, adpt, home, car, other, car, home, car_passenger, outside, car_passenger, home",
"home, adpt, work, pt, home, car, shop, car, home", "home, walk, other, adpt, work, adpt, shop, adpt, shop, adpt, home",
"home, adpt, leisure, adpt, shop, adpt, leisure, pt, home", "home, adpt, leisure, adpt, shop, adpt, home",
"home, car, outside, car, outside, car, outside, car, outside, car, home, adpt, education, pt, home",
"home, adpt, work, adpt, home", "home, adpt, shop, adpt, home",
"home, adpt, education, adpt, home, adpt, education, adpt, home",
"home, adpt, other, adpt, other, walk, leisure, adpt, other, adpt, home",
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, other, car, home",
"home, car, work, car, shop, car, home, adpt, work, adpt, home, car, home",
"home, walk, other, walk, education, adpt, home, adpt, education, walk, home, walk, home",
"home, adpt, shop, walk, leisure, adpt, home", "home, adpt, shop, walk, home, adpt, work, adpt, home",
"home, adpt, leisure, adpt, shop, walk, home", "home, walk, other, adpt, shop, walk, home, walk, other, walk, home, walk, other, walk, other, adpt, home",
"home, adpt, education, walk, home, walk, education, walk, home, walk, home",
"home, bike, education, bike, home, adpt, education, adpt, home, walk, home"
)), row.names = c(NA, 100L), class = "data.frame")
我不太确定您到底想做什么,但我知道您对检测活动adpt
发生的模式感兴趣。这通常在 NLP 中完成,下面是使用tidytext
包的解决方案。我将leg_activity
列拆分为所谓的n-grams
,即我按连续的单词序列分解文本。两个连续词的序列称为bi-gram
、三个连续词tri-gram
等。当我们计算这些n-grams
时,我们了解到哪些活动最常在adpt之前,哪些活动最常在adpt之后。
以下是为bi-grams
执行此操作的方法:
df %>%
unnest_tokens(bigram, leg_activity, token = "ngrams", n = 2) %>%
filter(str_detect(bigram, "adpt")) %>%
count(bigram, sort = TRUE)
bigram n
1 home adpt 100
2 adpt home 97
3 work adpt 51
4 adpt work 48
5 leisure adpt 27
6 adpt other 26
7 other adpt 26
8 adpt leisure 24
9 adpt shop 22
10 shop adpt 13
11 adpt education 10
12 education adpt 10
因此,adpt 最常以"home">开头,而"home"也是最常紧跟在"adpt"之后的内容。如果我们对连续发生的三个活动感兴趣,包括"adpt",我们可以对tri-grams
做同样的事情:
df %>%
unnest_tokens(bigram, leg_activity, token = "ngrams", n = 3) %>% #n is the only thing that changed
filter(str_detect(bigram, "adpt")) %>%
count(bigram, sort = TRUE)
bigram n
1 work adpt home 42
2 adpt work adpt 40
3 home adpt work 36
4 home adpt other 22
5 adpt other adpt 21
6 home adpt leisure 20
7 leisure adpt home 19
8 other adpt home 18
9 adpt leisure adpt 16
10 adpt home adpt 15
11 home adpt shop 12
12 adpt home car 11
13 adpt home walk 11
14 adpt shop adpt 11
15 home adpt education 10
16 education adpt home 9
[list continues]
这个列表要长得多,因为现在有更多可能的组合。如果您想了解更多信息,这里有一个关于 n-gram 的好教程的链接。这是你想做的吗?