我在R
中有一个这样的字符串ABCDE,"January 10, 2010",F,,,,GH,"March 9, 2009",,,
我想做一些类似str.split()
的事情,通过逗号和引号的所有组合划分为字符串数组,但保留引号中的逗号表示日期,以便我得到:
ABCDE
January 10, 2010
F
GH
March 9, 2009
感谢这是一种方法
data.frame(list = na.omit(
unname(unlist(read.csv(
text = 'ABCDE,"January 10, 2010",F,,,,GH,"March 9, 2009",,,',
check.names = F, header = F)))))
list
1 ABCDE
2 January 10, 2010
3 FALSE
4 GH
5 March 9, 2009
您可能应该在这里使用CSV解析器,但如果您想使用纯正则表达式方法,您可以尝试:
library(stringr)
library(dplyr)
x <- "ABCDE,"January 10, 2010",F,,,,GH,"March 9, 2009",,,"
y <- str_match_all(x, ""(.*?)"|[^,]+")[[1]]
output <- coalesce(y[,2], y[,1])
output
[1] "ABCDE" "January 10, 2010" "F" "GH"
[5] "March 9, 2009"
regex模式使用了一个交替的技巧,表示匹配:
"(.*?)"
匹配引号中的日期,但不捕获引号|
或[^,]+
匹配单个CSV项
如果模式如所示,那么一个regex选项将是创建分隔符并使用read.table
read.table(text = gsub('"', '', gsub('("[^,"]+,)(*SKIP)(*FAIL)|,',
'n', trimws(gsub(",{2,}", ",", str1), whitespace = ","), perl = TRUE)),
header = FALSE, fill = TRUE, sep = "n")
与产出
V1
1 ABCDE
2 January 10, 2010
3 F
4 GH
5 March 9, 2009
与scan
data.frame(V1 = setdiff(scan(text = str1, sep = ",",
what = character()), ""))
与产出
V1
1 ABCDE
2 January 10, 2010
3 F
4 GH
5 March 9, 2009
数据str1 <- "ABCDE,"January 10, 2010",F,,,,GH,"March 9, 2009",,,"
另一个选项可以是:
na.omit(stack(read.csv(text = str1, header = FALSE)))[1]
values
1 ABCDE
2 January 10, 2010
3 FALSE
4 GH
5 March 9, 2009
txt <- 'ABCDE,"January 10, 2010",F,,,,GH,"March 9, 2009",,,'