一个正则表达式,用于从R中的字符向量中删除多余的大数字



问题是:字符向量元素中的数字大得离谱。

troublesome_tweets
[1] "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: 3.1415926535897932384626433832795028841971693993751"
[2] "Sick of this yet? ...2847564823378678316527120190914564856692346034861045432 6648213393607260249141273724587006606315588174881520920..."    
[3] "He only did that 939478978398894709409784680708937809648608378964809798369864 x to the Cavs in that one series"                             
[4] "Am I a robot? Yes, affirmative. 0111001001101111011000100110111100100000011000100110111101101111011001110110100101100101"                   
[5] "Lazy rule # 18460826292036391273639018263920273820183737473920383930; You were too lazy to read the whole number .(:"                       
[6] "#thankyoushootusdown I LOVE YOU GUYS <3333333333333333333333333333333333333333333333333333333333333333333333333333333333333"                
[7] "Hvrtujikdsjktrfedwqcvbntrfedsw1123456787654325678876543234567865432456765432345678654214565tredscvbhjutwsdfvghyu. ! How I feel."            
[8] "I want to dock 555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555"                            
[9] "x's to the club wit momz here 0 while gone 1000000000000000000000000000000000000000000000000000000000000000"                                

目前的方法是:在每个元素后面使用gsub和一个特定的正则表达式,从给定的特定元素中删除大量。

clean_individual_tweets <- function(x){
    x <- gsub("[0-9][.][0-9]+", " ", x) 
    x <- gsub("[...][0-9]+", "", x) 
    x <- gsub("[0-9]+[...]", "", x) 
    x <- gsub("[0-9]+[ ][x]", "", x) 
    x <- gsub("[.][ ][0-9]+", " ", x) 
    x <- gsub("[#][ ][0-9]+", " ", x)
    x <- gsub("[<][0-9]+", " ", x) 
    x <- gsub("[a-zA-Z][0-9]+", " ", x) 
    x <- gsub("555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555", " ", x) 
    x <- gsub("1000000000000000000000000000000000000000000000000000000000000000", " ", x)}

cleaned_tweets <- clean_individual_tweets(troublesome_tweets
cleaned_tweets
[1] "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate:  "
[2] "Sick of this yet? .. .."                                                                  
[3] "He only did that  to the Cavs in that one series"                                         
[4] "Am I a robot? Yes, affirmative "                                                          
[5] "Lazy rule  ; You were too lazy to read the whole number .(:"                              
[6] "#thankyoushootusdown I LOVE YOU GUYS  "                                                   
[7] "Hvrtujikdsjktrfedwqcvbntrfeds tredscvbhjutwsdfvghyu. ! How I feel."                       
[8] "I want to dock  "                                                                         
[9] "x's to the club wit momz here 0 while gone  "      

希望的方法是:使用一个正则表达式,可以从大字符向量中删除所有这些大数字和其他数字,比如替换由10个以上数字组成的任何数字。我不想删除所有的号码,这很容易开始。我想特别删除工件编号并保留非工件编号。

数据

troublesome_tweets <- c(
  "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: 3.1415926535897932384626433832795028841971693993751"
, "Sick of this yet? ...2847564823378678316527120190914564856692346034861045432 6648213393607260249141273724587006606315588174881520920..."    
, "He only did that 939478978398894709409784680708937809648608378964809798369864 x to the Cavs in that one series"                             
, "Am I a robot? Yes, affirmative. 0111001001101111011000100110111100100000011000100110111101101111011001110110100101100101"                   
, "Lazy rule # 18460826292036391273639018263920273820183737473920383930; You were too lazy to read the whole number .(:"                       
, "#thankyoushootusdown I LOVE YOU GUYS <3333333333333333333333333333333333333333333333333333333333333333333333333333333333333"                
, "Hvrtujikdsjktrfedwqcvbntrfedsw1123456787654325678876543234567865432456765432345678654214565tredscvbhjutwsdfvghyu. ! How I feel."            
, "I want to dock 555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555"                            
, "x's to the club wit momz here 0 while gone 1000000000000000000000000000000000000000000000000000000000000000"                                
)

对于"至少10位数字",请尝试以下操作:

 gsub("[0-9]{10,}","",troublesome_tweets)
 [1] "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: 3."
 [2] "Sick of this yet? ... ..."                                                                 
 [3] "He only did that  x to the Cavs in that one series"                                        
 [4] "Am I a robot? Yes, affirmative. "                                                          
 [5] "Lazy rule # ; You were too lazy to read the whole number .(:"                              
 [6] "#thankyoushootusdown I LOVE YOU GUYS <"                                                    
 [7] "Hvrtujikdsjktrfedwqcvbntrfedswtredscvbhjutwsdfvghyu. ! How I feel."                        
 [8] "I want to dock "                                                                           
 [9] "x's to the club wit momz here 0 while gone "  

最新更新