所以我有一个文本要从中提取。这是我的文字:
Charge: Larceny; Charge: Stealing a motor vehicle;
我正在尝试创建这个
Charge1 Charge2 Charge3
Larceny Stealing a motor vehicle NA
有什么想法吗?现在我的代码是这样的:
data$charge <- str_extract_all(data, "(?=Charge:)(\D){4,100}")
但它只创建了一列。请帮忙!
如果您的文本都是相同的格式,那么使用tidyverse:将非常容易
require(tidyverse)
df <- data.frame(text = c("Charge: Larceny; Charge: Stealing a motor vehicle;",
"Charge: some_charge; Charge: another_charge; Charge: something_else"))
df %>% separate(text, c("Charge1", "Charge2", "Charge3"), sep = "; Charge: ") %>%
mutate(Charge1 = gsub("Charge: ", "", Charge1))
您可能需要通过清除一些挂起的分号
我们可以使用tidyverse
来完成这个
library(tidyerse)
tibble(str1) %>%
separate_rows(str1, sep= ";\s*") %>%
separate(str1, into = c("col1", "col2"), sep=":\s*") %>%
mutate(col1 = na_if(col1, "")) %>%
fill(col1) %>%
mutate(col1 = paste0(col1, row_number())) %>%
spread(col1, col2)
# A tibble: 1 x 3
# Charge1 Charge2 Charge3
# <chr> <chr> <chr>
#1 Larceny Stealing a motor vehicle NA
数据
str1 <- "Charge: Larceny; Charge: Stealing a motor vehicle;"
使用基R:
read.table(text=gsub("\s*Charge:\s*","",strng),sep=";",fill=T,col.names = paste0("Charge",1:3))
Charge1 Charge2 Charge3
1 Larceny Stealing a motor vehicle NA
您也可以使用strcapture
。但不如gsub
:灵活
strcapture(paste0(rep("\s*Charge:\s*([^;]+);",2),collapse=""),strng,data.frame(charge1=character(),charge2=character()))
charge1 charge2
1 Larceny Stealing a motor vehicle
稍微修改您的解决方案。注意?=
和?<=
之间的差异(先行和后向(以及\D
与;
匹配。
str_extract_all(data, "(?<=Charge: )[^;]+")
[[1]]
[1] "Larceny" "Stealing a motor vehicle"
所以str_extract_all((将返回一个向量列表,如何将它们放入数据中。框架可以在StackOverflow的其他角落看到。