r语言 - 从聊天记录中拆分玩家和聊天记录(文本挖掘)



我有一个聊天记录,其中包括4名玩家(a, B, C, D)和他们的聊天记录在我的数据框架中的一行(跨越许多组)。我想将每个短语分成一行,并在单独的一列中标识该短语的说话人。

我用下面的包尝试了很多事情,但都没有成功。心理dplyrsplitstackshapetidytextstringrtidyr

数据帧不是txt格式。文件,但我认为它需要是?

例如,聊天记录看起来是这样的。这些都在我的数据集中的一行中。

[1] " *** D has joined the chat ***"                                                                                                                                         
[2] " *** B has joined the chat ***"                                                                                                                                         
[3] " *** A has joined the chat ***"                                                                                                                                         
[4] "D: hi"                                                                                                                                                                  
[5] "B: hello!"                                                                                                                                                              
[6] "A: Hi!"                                                                                                                                                                 
[7] "D: i think oxygen is most important"                                                                                                                                    
[8] "A: I do too"                                                                                                                                                            
[9] " *** C has joined the chat ***"                                                                                                                                         
[10] "B: agreed, that was my #1"                                                                                                                                              
[11] "A: I didnt at first but then on second guess"                                                                                                                           
[12] "A: oxygen then water"                                                                                                                                                   
[13] "C: hi hi"                                                              

我想要以下内容(每一行都是一个新短语的这些列)

tbody> <<tr>B
球员ID短语
你好!
你好!
library(dplyr)
library(tidyr)
d %>%
t() %>%
as.data.frame("V1") %>%
filter(!grepl("***", V1, fixed = TRUE)) %>%
separate(V1, into = c("PlayerID", "Phrase"), sep = ": ") %>%
mutate(Count = nchar(Phrase))

结果:

#>   PlayerID                                    Phrase Count
#> 1        D                                        hi     2
#> 2        B                                    hello!     6
#> 3        A                                       Hi!     3
#> 4        D          i think oxygen is most important    32
#> 5        A                                  I do too     8
#> 6        B                    agreed, that was my #1    22
#> 7        A I didnt at first but then on second guess    41
#> 8        A                         oxygen then water    17
#> 9        C                                     hi hi     5

你可以使用添加到dplyr链来计算每个玩家的字符数:

group_by(PlayerID) %>%
summarize(Total = sum(Count))
#>   PlayerID Total
#>   <chr>    <int>
#> 1 A           69
#> 2 B           28
#> 3 C            5
#> 4 D           34

数据:

d <- structure(c(" *** D has joined the chat ***", " *** B has joined the chat ***", 
" *** A has joined the chat ***", "D: hi", "B: hello!", "A: Hi!", 
"D: i think oxygen is most important", "A: I do too", " *** C has joined the chat ***", 
"B: agreed, that was my #1", "A: I didnt at first but then on second guess", 
"A: oxygen then water", "C: hi hi"), .Dim = c(1L, 13L))
Created on 2022-05-25 by the reprex package (v2.0.1)

最新更新