问题是:字符向量元素中的数字大得离谱。
troublesome_tweets
[1] "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: 3.1415926535897932384626433832795028841971693993751"
[2] "Sick of this yet? ...2847564823378678316527120190914564856692346034861045432 6648213393607260249141273724587006606315588174881520920..."
[3] "He only did that 939478978398894709409784680708937809648608378964809798369864 x to the Cavs in that one series"
[4] "Am I a robot? Yes, affirmative. 0111001001101111011000100110111100100000011000100110111101101111011001110110100101100101"
[5] "Lazy rule # 18460826292036391273639018263920273820183737473920383930; You were too lazy to read the whole number .(:"
[6] "#thankyoushootusdown I LOVE YOU GUYS <3333333333333333333333333333333333333333333333333333333333333333333333333333333333333"
[7] "Hvrtujikdsjktrfedwqcvbntrfedsw1123456787654325678876543234567865432456765432345678654214565tredscvbhjutwsdfvghyu. ! How I feel."
[8] "I want to dock 555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555"
[9] "x's to the club wit momz here 0 while gone 1000000000000000000000000000000000000000000000000000000000000000"
目前的方法是:在每个元素后面使用gsub和一个特定的正则表达式,从给定的特定元素中删除大量。
clean_individual_tweets <- function(x){
x <- gsub("[0-9][.][0-9]+", " ", x)
x <- gsub("[...][0-9]+", "", x)
x <- gsub("[0-9]+[...]", "", x)
x <- gsub("[0-9]+[ ][x]", "", x)
x <- gsub("[.][ ][0-9]+", " ", x)
x <- gsub("[#][ ][0-9]+", " ", x)
x <- gsub("[<][0-9]+", " ", x)
x <- gsub("[a-zA-Z][0-9]+", " ", x)
x <- gsub("555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555", " ", x)
x <- gsub("1000000000000000000000000000000000000000000000000000000000000000", " ", x)}
cleaned_tweets <- clean_individual_tweets(troublesome_tweets
cleaned_tweets
[1] "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: "
[2] "Sick of this yet? .. .."
[3] "He only did that to the Cavs in that one series"
[4] "Am I a robot? Yes, affirmative "
[5] "Lazy rule ; You were too lazy to read the whole number .(:"
[6] "#thankyoushootusdown I LOVE YOU GUYS "
[7] "Hvrtujikdsjktrfedwqcvbntrfeds tredscvbhjutwsdfvghyu. ! How I feel."
[8] "I want to dock "
[9] "x's to the club wit momz here 0 while gone "
希望的方法是:使用一个正则表达式,可以从大字符向量中删除所有这些大数字和其他数字,比如替换由10个以上数字组成的任何数字。我不想删除所有的号码,这很容易开始。我想特别删除工件编号并保留非工件编号。
数据
troublesome_tweets <- c(
"Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: 3.1415926535897932384626433832795028841971693993751"
, "Sick of this yet? ...2847564823378678316527120190914564856692346034861045432 6648213393607260249141273724587006606315588174881520920..."
, "He only did that 939478978398894709409784680708937809648608378964809798369864 x to the Cavs in that one series"
, "Am I a robot? Yes, affirmative. 0111001001101111011000100110111100100000011000100110111101101111011001110110100101100101"
, "Lazy rule # 18460826292036391273639018263920273820183737473920383930; You were too lazy to read the whole number .(:"
, "#thankyoushootusdown I LOVE YOU GUYS <3333333333333333333333333333333333333333333333333333333333333333333333333333333333333"
, "Hvrtujikdsjktrfedwqcvbntrfedsw1123456787654325678876543234567865432456765432345678654214565tredscvbhjutwsdfvghyu. ! How I feel."
, "I want to dock 555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555"
, "x's to the club wit momz here 0 while gone 1000000000000000000000000000000000000000000000000000000000000000"
)
对于"至少10位数字",请尝试以下操作:
gsub("[0-9]{10,}","",troublesome_tweets)
[1] "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: 3."
[2] "Sick of this yet? ... ..."
[3] "He only did that x to the Cavs in that one series"
[4] "Am I a robot? Yes, affirmative. "
[5] "Lazy rule # ; You were too lazy to read the whole number .(:"
[6] "#thankyoushootusdown I LOVE YOU GUYS <"
[7] "Hvrtujikdsjktrfedwqcvbntrfedswtredscvbhjutwsdfvghyu. ! How I feel."
[8] "I want to dock "
[9] "x's to the club wit momz here 0 while gone "