我有一个聊天记录,其中包括4名玩家(a, B, C, D)和他们的聊天记录在我的数据框架中的一行(跨越许多组)。我想将每个短语分成一行,并在单独的一列中标识该短语的说话人。
我用下面的包尝试了很多事情,但都没有成功。心理dplyrsplitstackshapetidytextstringrtidyr
数据帧不是txt格式。文件,但我认为它需要是?
例如,聊天记录看起来是这样的。这些都在我的数据集中的一行中。
[1] " *** D has joined the chat ***"
[2] " *** B has joined the chat ***"
[3] " *** A has joined the chat ***"
[4] "D: hi"
[5] "B: hello!"
[6] "A: Hi!"
[7] "D: i think oxygen is most important"
[8] "A: I do too"
[9] " *** C has joined the chat ***"
[10] "B: agreed, that was my #1"
[11] "A: I didnt at first but then on second guess"
[12] "A: oxygen then water"
[13] "C: hi hi"
我想要以下内容(每一行都是一个新短语的这些列)
球员ID | 短语 | 你好! | B
---|---|
你好! |
library(dplyr)
library(tidyr)
d %>%
t() %>%
as.data.frame("V1") %>%
filter(!grepl("***", V1, fixed = TRUE)) %>%
separate(V1, into = c("PlayerID", "Phrase"), sep = ": ") %>%
mutate(Count = nchar(Phrase))
结果:
#> PlayerID Phrase Count
#> 1 D hi 2
#> 2 B hello! 6
#> 3 A Hi! 3
#> 4 D i think oxygen is most important 32
#> 5 A I do too 8
#> 6 B agreed, that was my #1 22
#> 7 A I didnt at first but then on second guess 41
#> 8 A oxygen then water 17
#> 9 C hi hi 5
你可以使用添加到dplyr链来计算每个玩家的字符数:
group_by(PlayerID) %>%
summarize(Total = sum(Count))
#> PlayerID Total
#> <chr> <int>
#> 1 A 69
#> 2 B 28
#> 3 C 5
#> 4 D 34
数据:
d <- structure(c(" *** D has joined the chat ***", " *** B has joined the chat ***",
" *** A has joined the chat ***", "D: hi", "B: hello!", "A: Hi!",
"D: i think oxygen is most important", "A: I do too", " *** C has joined the chat ***",
"B: agreed, that was my #1", "A: I didnt at first but then on second guess",
"A: oxygen then water", "C: hi hi"), .Dim = c(1L, 13L))
Created on 2022-05-25 by the reprex package (v2.0.1)