我已经读过一些关于拆分大写和小写的好问题,比如这个和这个,但我无法让它们处理我的数据。
# here my data
data <- data.frame(text = c("SOME UPPERCASES And some Lower Cases"
,"OTHER UPPER CASES And other words"
, "Some lower cases AND UPPER CASES"
,"ONLY UPPER CASES"
,"Only lower cases, maybe"
,"UPPER lower UPPER!"))
data
text
1 SOME UPPERCASES And some Lower Cases
2 OTHER UPPER CASES And other words
3 Some lower cases AND UPPER CASES
4 ONLY UPPER CASES
5 Only lower cases, maybe
6 UPPER lower UPPER!
期望的结果应该是这样的:
V1 V2
1 SOME UPPERCASES And some Lower Cases
2 OTHER UPPER CASES And other words
3 AND UPPER CASES Some lower cases
4 ONLY UPPER CASES NA
5 NA Only lower cases, maybe
6 UPPER UPPER! lower
因此,将所有仅大写字母的单词与其他单词分开。
作为测试,我只尝试了一行,但没有一种效果很好:
strsplit(x= data$text[1], split="[[:upper:]]") # error
gsub('([[:upper:]])', ' \1', data$text[1]) # not good results
library(reshape)
transform(data, FOO = colsplit(data$text[1], split = "[[:upper:]]", names = c('a', 'b'))) # neither good results
data:
data <- data.frame(text = c("SOME UPPERCASES And some Lower Cases"
,"OTHER UPPER CASES And other words"
, "Some lower cases AND UPPER CASES"
,"ONLY UPPER CASES"
,"Only lower cases, maybe"
,"UPPER lower UPPER!"))
法典:
library(magrittr)
UpperCol <- regmatches(data$text , gregexpr("\b[A-Z]+\b",data$text)) %>% lapply(paste, collapse = " ") %>% unlist
notUpperCol <- regmatches(data$text , gregexpr("\b(?![A-Z]+\b)[a-zA-Z]+\b",data$text, perl = T)) %>% lapply(paste, collapse = " ") %>% unlist
result <- data.frame(I(UpperCol), I(notUpperCol))
result[result == ""] <- NA
结果:
# UpperCol notUpperCol
#1 SOME UPPERCASES And some Lower Cases
#2 OTHER UPPER CASES And other words
#3 AND UPPER CASES Some lower cases
#4 ONLY UPPER CASES <NA>
#5 <NA> Only lower cases maybe
#6 UPPER UPPER lower
- 诀窍是正则表达式。所以学习正则表达式
- 感谢Wiktor Stribiżew的一些优化。
一种使用 stringi 包的方法:
library(stringi)
l1 <- stri_extract_all_regex(dat$text, "\b[A-Z]+\b")
l2 <- mapply(setdiff, stri_extract_all_words(dat$text), l1)
res <- data.frame(all_upper = sapply(l1, paste, collapse = " "),
not_all_upper = sapply(l2, paste, collapse = " "),
stringsAsFactors = FALSE)
res[res == "NA"] <- NA
res[res == ""] <- NA
这给了:
> res all_upper not_all_upper 1 SOME UPPERCASES And some Lower Cases 2 OTHER UPPER CASES And other words 3 AND UPPER CASES Some lower cases 4 ONLY UPPER CASES <NA> 5 <NA> Only lower cases maybe 6 UPPER UPPER lower
separate <- function(x) {
x <- unlist(strsplit(as.character(x), "\s+"))
with_lower <- grepl("\p{Ll}", x, perl = TRUE)
list(paste(x[!with_lower], collapse = " "), paste(x[with_lower], collapse = " "))
}
do.call(rbind, lapply(data$text, separate))
[,1] [,2]
[1,] "SOME UPPERCASES" "And some Lower Cases"
[2,] "OTHER UPPER CASES" "And other words"
[3,] "AND UPPER CASES" "Some lower cases"
[4,] "ONLY UPPER CASES" ""
[5,] "" "Only lower cases, maybe"
[6,] "UPPER UPPER!" "lower"