grep/pdfgrep-perl-regex检查多行

我想检查文本中是否存在不同的单词。这些单词在全文中。但是我找不到一个带有perl正则表达式的grep/pdfgrep正则表达式。

My text with foo with other text and
many many
other lines
in the same text
for bar and i don't know

我的pdfgrep正则表达式(与grep相同(

pdfgrep -i -P "foo.*bar" mypdf.pdf

这不起作用，因为单词在不同的行上。我已经尝试了许多其他正则表达式作为.*，其中我找到了其他正则表达式：

(?s).*
([sS]*)
(.*?)

还有很多其他的。给grep/pdfgrep一个解决方案来找到这个？

我会检查我的pdf文件是否插入了所有的搜索词

编辑：对我来说，现在可以使用这些命令。感谢Pierre François

# Find foo AND bar
pdftotext mypdf.pdf - | tr 'n' ' ' | grep -P 'foo.*?bar'
# Find foo OR bar
pdftotext mypdf.pdf - | tr 'n' ' ' | grep -P 'foo|bar'
# The same Commands but with pdfgrep
# Find foo AND bar
pdfgrep -i -P ".*" mypdf.pdf | tr 'n' ' ' | grep -P 'foo.*?bar'
# Find foo OR bar
pdfgrep -i -P ".*" mypdf.pdf | tr 'n' ' ' | grep -P 'foo|bar'

与sed的协同命令有效，但只查找foo OR bar，而不查找foo and bar

如果安装了pdftotext，则可以使用grep以外的其他方法来获得跨多行执行的正则表达式。尝试：

pdftotext mypdf.pdf - | sed -e '/foo/,/bar/p' -e d

第一个命令从PDF文件中提取文本到标准输出流，第二个命令打印从包含foo的一行到包含bar的另一行的所有行，并从输出中删除所有其他行。

编辑

具有pdftotext、tr和grep的另一个解决方案如下：

pdftotext mypdf.pdf - | tr 'n' ' ' | grep -P 'foo.*?bar'

命令tr用于将每一个换行符更改为一个空格。我在grep的regex中使用了非贪婪修饰符?，只有在必须分别匹配同一字符串的多次出现的情况下，它才能与-P选项一起使用。

相关内容

最新更新

热门标签：