Regex搜索提取r中的BibTeX标题字符串

我在R中有一个数据框架，其中一列名为Title，是一个BibTeX条目，看起来像这样:

={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},n  
author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},n  
journal={Journal of the ACM (JACM)},n  
volume={38},n  
number={3},n  
pages={690--728},n  
year={1991},n  
publisher={ACM New York, NY, USA}n}

我只需要提取BibTeX引文的标题，这是={之后和下一个}之前的字符串

在这个例子中，输出应该是:

Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems

我需要对数据帧中的所有行都这样做。并非所有行都具有相同数量的BibTeX字段，因此正则表达式必须忽略第一个}

之后的所有内容。我目前正在尝试sub(".*\={\}\s*(.+?)\s*\|.*$", "\1", data$Title)，并遇到TRE pattern compilation error 'Invalid contents of {}'

我该怎么做?

一个可能的解决方案，使用stringr::str_extract和环顾四周:

library(stringr)
str_extract(s, "(?<=\{)[^}]+(?=\})")
#> [1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"

注意{字符是一个特殊的regex元字符，需要进行转义。

要匹配花括号之间的任何字符串，您需要一个基于反字符类(反括号表达式)的模式，如{([^{}]*)}。

可以使用

sub(".*?=\{([^{}]*)}.*", "\1", df$Title)

参见regex演示和R演示:

Title <- c("={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},n  author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},n  journal={Journal of the ACM (JACM)},n  volume={38},n  number={3},n  pages={690--728},n  year={1991},n  publisher={ACM New York, NY, USA}n}")
sub(".*?=\{([^{}]*)}.*", "\1", Title)

输出:

[1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"

模式细节:

.*?-任何零或更多字符，尽可能少
=\{-={子字符串
([^{}]*)-组1 (1):除花括号以外的任何零个或多个字符
}}char(这不是特殊的,不需要逃避)
.*-剩余的字符串。

相关内容

最新更新

热门标签：