我有一个列,其中包含来自Twitter的文本,这些文本都是原始帖子和对帐户的响应/回复。
df(2行/200万)示例:
ID | Tweet
1 @mcr_chick i wanna sleep.lol.
2 Someone burned a hole.
我想删除所有具有'@'和相应的名称附加到'@'符号的推文。正如您所看到的,有些tweet中没有@name,因此我需要以某种方式仅按包含"@"或其他内容的id进行分组。
所需输出:
ID | Tweet | Original_Tweet | Reply_Tweet
1 @mcr_chick i wanna sleep.lol. NA i wanna sleep.lol.
2 Someone burned a hole. Someone burned a hole.
我正在使用子命令从文本中删除'@',然后删除tweet中的第一个单词,但我仍然需要按包含'@'的那些分组。
任何帮助都将非常感激!
与akrun的方法略有不同
library(tidyverse)
data <- tibble(id=c(1,2),
tweet=c("@mcr_chick i wanna sleep.lol.",
"Someone burned a hole."))
data %>%
mutate(
#original tweet
original = ifelse(
#look for twitter handle
str_detect(tweet, "@\w+"),
# if found, NA
NA,
# otherwise, text in tweet column
tweet),
#reply tweet
reply = ifelse(
# look for twitter handle
str_detect(tweet, "@\w+"),
# if found, remove handle
str_remove(tweet,"@\w+"),
# otherwise NA
NA),
#clean up some whitespace
reply = str_trim(reply)
)
这回报:
id tweet original reply
<dbl> <chr> <chr> <chr>
1 @mcr_chick i wanna~ NA i wanna sle~
2 Someone burned a h~ Someone burned ~ NA
我们可以str_extract
从字符串的开始(^
)提取没有'@'字符的'Tweet'(因此第一行变成NA,因为开始有一个@
)来创建'Original_Tweet',并使用case_when
通过删除以"@"开头的子字符串来创建'Reply_tweet'列。后跟非空格的字符(\s+
)(默认情况下,case_when
中的TRUE
返回NA)
library(dplyr)
library(stringr)
df1 %>%
mutate(Original_Tweet = str_extract(Tweet, "^[^@]+"),
Reply_tweet = case_when(str_detect(Tweet, "@") ~
str_remove(Tweet, "^@[^ ]+\s+")))
与产出
ID Tweet Original_Tweet Reply_tweet
1 1 @mcr_chick i wanna sleep.lol. <NA> i wanna sleep.lol.
2 2 Someone burned a hole Someone burned a hole <NA>
数据df1 <- structure(list(ID = 1:2, Tweet = c("@mcr_chick i wanna sleep.lol.",
"Someone burned a hole")), class = "data.frame", row.names = c(NA,
-2L))
这里有一种使用已定义的正则表达式模式" ?@\w+ ?"
的替代方法,它基本上搜索所有以@
开头的字符串,直到该字符串的末尾:
然后我们使用一些stringr
函数和ifelse
语句:
library(dplyr)
library(stringr)
tweet_pattern <- " ?@\w+ ?"
df %>%
mutate(Original_Tweet = str_replace(Tweet, tweet_pattern, NA_character_),
Reply_Tweet = ifelse(str_detect(Tweet, tweet_pattern),
str_remove(Tweet, tweet_pattern),
NA_character_))
输出:
ID Tweet Original_Tweet Reply_Tweet
1 1 @mcr_chick i wanna sleep.lol. <NA> i wanna sleep.lol.
2 2 Someone burned a hole. Someone burned a hole. <NA>
逻辑类似于@Anthony Schmidt的回答,但以r为基数。
transform(data, Original_Tweet = ifelse(grepl('@',tweet,fixed = TRUE),NA, tweet),
reply_tweet = ifelse(grepl('@', tweet, fixed = TRUE),
sub('@.*?\s+', '', tweet), NA))
# id tweet Original_Tweet reply_tweet
#1 1 @mcr_chick i wanna sleep.lol. <NA> i wanna sleep.lol.
#2 2 Someone burned a hole. Someone burned a hole. <NA>