是否有 R 函数可以删除不以时间戳开头的行?

我试图通过清理我和朋友之间的Whatsapp聊天中的一些数据来熟悉R。到目前为止,我已经将.txt转换为.csv 但我有一个问题。





我一直在尝试使用正则表达式。我一直在遵循教程 https://journocode.com/2016/01/31/project-visualizing-whatsapp-chat-logs-part-1-cleaning-data/但结果不是我预期的

# Add 5 empty rows to end to make space for shift
chat <- cbind(chat, matrix(nrow = nrow(chat), ncol = 5))
cat("Rows without time stamp:", length(grep("^\D", chat[,1])),
"(", grep("^\D", chat[,1]), ")", "n")
for(row in grep("^\D", chat[,1])){
end <- which(is.na(chat[row,]))[1] #first column without text in it
chat[row, 6:(5+end)] <- chat[row, 1:(end-1)]
chat[row, 1:(end-1)] <- NA
chat <- chat[-which(apply(chat, 1, function(x) all(is.na(x))) == TRUE),]

我最终得到了一个非常混乱的csv文件。 时间戳到处都是,聊天到处都是。定义不是我想到的结果


chat_raw <- scan(text = "
12/07/2017, 22:35 - Messages to this group are now secured with end-to-end encryption. Tap for more info.
12/07/2017, 22:35 - You created group 'Tes'
12/07/2017, 22:35 - Johannes Gruber: <Media omitted>
12/07/2017, 22:35 - Johannes Gruber: Fruit bread with cheddar <U+263A><U+0001F44C><U+0001F3FB>
13/07/2017, 09:12 - Test: It's fun doing text analysis with R
isn't it?
13/07/2017, 09:16 - Johannes Gruber: Haha it sure is <U+0001F605>
28/09/2018, 13:27 - Johannes Gruber: Did you know there is an incredible number of emojis in WhatsApp? Check it out:
", what = character(), sep = "n")


time <- stringi::stri_extract_first_regex(
str = chat_raw,
pattern = "^\d{2}/\d{2}/\d{4}, \d{2}:\d{2}"


#> [1] "12/07/2017, 22:35" "12/07/2017, 22:35" "12/07/2017, 22:35"
#> [4] "12/07/2017, 22:35" "13/07/2017, 09:12" NA                 
#> [7] "13/07/2017, 09:16" "28/09/2018, 13:27"


for (l in which(is.na(time))) {
chat_raw[l - 1] <- stringi::stri_paste(chat_raw[l - 1], chat_raw[l],
sep = " ")

在这种情况下,which(is.na(time))将仅返回 6,因为这是唯一NA时间的行。所以你可以把chat_raw[l - 1]读成chat_raw[5],即chat_raw的第五行。stringi::stri_pastepaste()相同,因此第6行被添加到第5行。如果需要,您可以选择其他分隔符。我选择"n"来标记包裹中的换行符。现在chat_rawtime向量仍然有这个现在对我们来说毫无用处的附加元素。我们可以通过以下方式删除它:

chat_raw <- chat_raw[!is.na(time)]
time <- time[!is.na(time)]


time = time,
text = chat_raw
#> # A tibble: 7 x 2
#>   time            text                                                     
#>   <chr>           <chr>                                                    
#> 1 12/07/2017, 22~ 12/07/2017, 22:35 - Messages to this group are now secur~
#> 2 12/07/2017, 22~ 12/07/2017, 22:35 - You created group 'Tes'              
#> 3 12/07/2017, 22~ 12/07/2017, 22:35 - Johannes Gruber: <Media omitted>     
#> 4 12/07/2017, 22~ 12/07/2017, 22:35 - Johannes Gruber: Fruit bread with ch~
#> 5 13/07/2017, 09~ 13/07/2017, 09:12 - Test: It's fun doing text analysis w~
#> 6 13/07/2017, 09~ 13/07/2017, 09:16 - Johannes Gruber: Haha it sure is <U+~
#> 7 28/09/2018, 13~ 28/09/2018, 13:27 - Johannes Gruber: Did you know there ~


如果您想对whatsapp数据做更多的事情,请完成我的软件包的演示。我还没有在 CRAN 上发布它,因为我认为到目前为止的贡献有点小,但如果你能想到很酷的功能,我可以添加它们,也许随着时间的推移,这成为一个合法的包。
