r语言 - 按大写和小写拆分向量

  • 本文关键字:拆分 向量 r语言 r regex
  • 更新时间 :
  • 英文 :


我已经读过一些关于拆分大写和小写的好问题,比如这个和这个,但我无法让它们处理我的数据。

# here my data
data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
,"OTHER UPPER CASES   And other words"
, "Some lower cases        AND UPPER CASES"
,"ONLY UPPER CASES"
,"Only lower cases, maybe"
,"UPPER lower UPPER!"))
data
text
1 SOME UPPERCASES     And some Lower Cases
2      OTHER UPPER CASES   And other words
3  Some lower cases        AND UPPER CASES
4                         ONLY UPPER CASES
5                  Only lower cases, maybe
6                        UPPER lower UPPER!

期望的结果应该是这样的:

V1                  V2
1      SOME UPPERCASES     And some Lower Cases
2      OTHER UPPER CASES   And other words
3      AND UPPER CASES     Some lower cases        
4      ONLY UPPER CASES    NA
5      NA                  Only lower cases, maybe
6      UPPER UPPER!         lower

因此,将所有仅大写字母的单词与其他单词分开。

作为测试,我只尝试了一行,但没有一种效果很好:

strsplit(x= data$text[1], split="[[:upper:]]")   # error
gsub('([[:upper:]])', ' \1', data$text[1])      # not good results
library(reshape)
transform(data, FOO = colsplit(data$text[1], split = "[[:upper:]]", names = c('a', 'b')))                                        # neither good results

data:

data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
,"OTHER UPPER CASES   And other words"
, "Some lower cases        AND UPPER CASES"
,"ONLY UPPER CASES"
,"Only lower cases, maybe"
,"UPPER lower UPPER!"))

法典:

library(magrittr)
UpperCol    <- regmatches(data$text , gregexpr("\b[A-Z]+\b",data$text)) %>% lapply(paste, collapse = " ") %>% unlist
notUpperCol <- regmatches(data$text , gregexpr("\b(?![A-Z]+\b)[a-zA-Z]+\b",data$text, perl = T)) %>% lapply(paste, collapse = " ") %>% unlist
result <- data.frame(I(UpperCol), I(notUpperCol))
result[result == ""] <- NA

结果:

#           UpperCol            notUpperCol
#1   SOME UPPERCASES   And some Lower Cases
#2 OTHER UPPER CASES        And other words
#3   AND UPPER CASES       Some lower cases
#4  ONLY UPPER CASES                   <NA>
#5              <NA> Only lower cases maybe
#6       UPPER UPPER                  lower

  • 诀窍是正则表达式。所以学习正则表达式
  • 感谢Wiktor Stribiżew的一些优化。

一种使用 stringi 包的方法:

library(stringi)
l1 <- stri_extract_all_regex(dat$text, "\b[A-Z]+\b")
l2 <- mapply(setdiff, stri_extract_all_words(dat$text), l1)
res <- data.frame(all_upper = sapply(l1, paste, collapse = " "),
not_all_upper = sapply(l2, paste, collapse = " "),
stringsAsFactors = FALSE)
res[res == "NA"] <- NA
res[res == ""] <- NA

这给了:

> res
all_upper          not_all_upper
1   SOME UPPERCASES   And some Lower Cases
2 OTHER UPPER CASES        And other words
3   AND UPPER CASES       Some lower cases
4  ONLY UPPER CASES                   <NA>
5              <NA> Only lower cases maybe
6       UPPER UPPER                  lower
separate <- function(x) {
x <- unlist(strsplit(as.character(x), "\s+"))
with_lower <- grepl("\p{Ll}", x, perl = TRUE)
list(paste(x[!with_lower], collapse = " "),  paste(x[with_lower], collapse = " "))
}

do.call(rbind, lapply(data$text, separate))
[,1]                [,2]                     
[1,] "SOME UPPERCASES"   "And some Lower Cases"   
[2,] "OTHER UPPER CASES" "And other words"        
[3,] "AND UPPER CASES"   "Some lower cases"       
[4,] "ONLY UPPER CASES"  ""                       
[5,] ""                  "Only lower cases, maybe"
[6,] "UPPER UPPER!"      "lower"  

最新更新