如何从文本数据集中提取特定字段?



my data

dfx=structure(list(V1 = c("(Description and Operation, 100-00 General Information) <a data-searchnum=G2107576  data-procuid=G1620638>Acceleration Control - Overview", 
"(Description and Operation, 310-02 Acceleration Control) <a data-searchnum=G2232632  data-procuid=G2210282>Acceleration Control - System Operation and Component Description", 
"(Description and Operation, 310-02 Acceleration Control) <a data-searchnum=G2232633  data-procuid=G2210283>Acceleration Control", 
"(Diagnosis and Testing, 310-02 Acceleration Control) <a data-searchnum=G2118147  data-procuid=G2118148>Accelerator Pedal ")), class = "data.frame", row.names = c(NA, 
                                                -4L))

我需要提取data-searchnum并将其存储在新的df

G2107576
G2232632
G2232633
G2118147
G2110035

data-searchnum=子字符串后使用str_extract和捕获组((...))

library(stringr)
str_extract(dfx$V1, 'data-searchnum=(\S+)', group = 1)
[1] "G2107576" "G2232632" "G2232633" "G2118147"

或str_replace捕获data-searchnum=之后的非空白字符并替换为反向引用(\1)

str_replace(dfx$V1, ".*data-searchnum=(\S+)\s+.*", "\1")
[1] "G2107576" "G2232632" "G2232633" "G2118147"

如果我们正在创建一个新的数据

library(dplyr)
df2 <- dfx %>%
mutate(V1 = str_extract(V1, 'data-searchnum=(\S+)', group = 1))
> df2
V1
1 G2107576
2 G2232632
3 G2232633
4 G2118147

或者在base R中使用与str_replace相同的方法

sub(".*data-searchnum=(\S+)\s+.*", "\1", dfx$V1)
[1] "G2107576" "G2232632" "G2232633" "G2118147"

相关内容

  • 没有找到相关文章

最新更新