我有以下文本,需要在特定单词之前和之后提取特定单词
示例:
sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, nselect languagenn, home > corporate social responsibility > nsocial reportn > quality assurancen, nensuring provision of safe products, nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
stri_extract_all_fixed(sometext , c('engineering plastics', 'iso 9001','office automation'), case_insensitive=TRUE, overlap=TRUE)
实际输出以下
[[1]]
[1] "engineering plastics"
[[2]]
[1] "iso 9001"
[[3]]
[1] "office automation"
所需的输出:
[1] globally expanding its engineering plastics centered on polycarbonate resin
[2] accordance with iso 9001 (8-4, 8-2), the regular implementation of
基本上需要在我提到的特定词之前和之后提取文本
这是一个以下想法:
sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, nselect languagenn, home > corporate social responsibility > nsocial reportn > quality assurancen, nensuring provision of safe products, nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
words <- c('engineering plastics', 'iso 9001','office automation')
pattern <- stri_paste("([^ ]+ ){0,10}", words, "([^ ]+ ){0,10}")
stri_extract_all_regex(sometext , pattern, case_insensitive=TRUE, overlap=TRUE)
说明:我要在您想要的单词之前和之后添加简单的正则是:
"([^ ]+ ){0,10}"
这意味着:
- 除了空间以外的任何东西,重复了多次
- 然后空间
- 所有这些最多十次
这是非常简单而幼稚的(例如,它将所有'&amp;''或'>'视为单词),但有效。