我在R中有一个数据框架,其中一列名为Title,是一个BibTeX条目,看起来像这样:
={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},n
author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},n
journal={Journal of the ACM (JACM)},n
volume={38},n
number={3},n
pages={690--728},n
year={1991},n
publisher={ACM New York, NY, USA}n}
我只需要提取BibTeX引文的标题,这是={
之后和下一个}
之前的字符串
Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems
我需要对数据帧中的所有行都这样做。并非所有行都具有相同数量的BibTeX字段,因此正则表达式必须忽略第一个}
之后的所有内容。我目前正在尝试sub(".*\={\}\s*(.+?)\s*\|.*$", "\1", data$Title)
,并遇到TRE pattern compilation error 'Invalid contents of {}'
我该怎么做?
一个可能的解决方案,使用stringr::str_extract
和环顾四周:
library(stringr)
str_extract(s, "(?<=\{)[^}]+(?=\})")
#> [1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"
注意{
字符是一个特殊的regex元字符,需要进行转义。
要匹配花括号之间的任何字符串,您需要一个基于反字符类(反括号表达式)的模式,如{([^{}]*)}
。
可以使用
sub(".*?=\{([^{}]*)}.*", "\1", df$Title)
参见regex演示和R演示:
Title <- c("={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},n author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},n journal={Journal of the ACM (JACM)},n volume={38},n number={3},n pages={690--728},n year={1991},n publisher={ACM New York, NY, USA}n}")
sub(".*?=\{([^{}]*)}.*", "\1", Title)
输出:
[1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"
模式细节:
.*?
-任何零或更多字符,尽可能少=\{
-={
子字符串([^{}]*)
-组1 (1
):除花括号以外的任何零个或多个字符}
}
char(这不是特殊的,不需要逃避).*
-剩余的字符串。