Regex搜索提取r中的BibTeX标题字符串



我在R中有一个数据框架,其中一列名为Title,是一个BibTeX条目,看起来像这样:

={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},n  
author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},n  
journal={Journal of the ACM (JACM)},n  
volume={38},n  
number={3},n  
pages={690--728},n  
year={1991},n  
publisher={ACM New York, NY, USA}n}

我只需要提取BibTeX引文的标题,这是={之后和下一个}之前的字符串

在这个例子中,输出应该是:
Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems

我需要对数据帧中的所有行都这样做。并非所有行都具有相同数量的BibTeX字段,因此正则表达式必须忽略第一个}

之后的所有内容。我目前正在尝试sub(".*\={\}\s*(.+?)\s*\|.*$", "\1", data$Title),并遇到TRE pattern compilation error 'Invalid contents of {}'

我该怎么做?

一个可能的解决方案,使用stringr::str_extract和环顾四周:

library(stringr)
str_extract(s, "(?<=\{)[^}]+(?=\})")
#> [1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"

注意{字符是一个特殊的regex元字符,需要进行转义。

要匹配花括号之间的任何字符串,您需要一个基于反字符类(反括号表达式)的模式,如{([^{}]*)}

可以使用

sub(".*?=\{([^{}]*)}.*", "\1", df$Title)

参见regex演示和R演示:

Title <- c("={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},n  author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},n  journal={Journal of the ACM (JACM)},n  volume={38},n  number={3},n  pages={690--728},n  year={1991},n  publisher={ACM New York, NY, USA}n}")
sub(".*?=\{([^{}]*)}.*", "\1", Title)

输出:

[1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"

模式细节:

  • .*?-任何零或更多字符,尽可能少
  • =\{-={子字符串
  • ([^{}]*)-组1 (1):除花括号以外的任何零个或多个字符
  • }}char(这不是特殊的,不需要逃避)
  • .*-剩余的字符串。

最新更新