用于检查 PDF 的 BASH 脚本是否被 ocr'd

真不知道从何说起

我有一个linux服务器，有超过8000个PDf，需要知道哪些PDf已经被读取，哪些没有。

我想用某种脚本调用XPDF来检查pdf，但老实说，我不确定这是否可能

提前感谢您的帮助

确保安装了命令行工具pdffonts。(这有两个版本:一艘船作为xpdf-utils的一部分，另一艘作为poppler-utils的一部分。)

所有只包含扫描页面的pdf将不会使用任何字体(无论是嵌入的还是未嵌入的)。

命令行

pdffonts /path/to/scanned.pdf

将不显示该文件的任何字体信息。

这可能已经足够你把你的文件分成两个不同的集合。

如果您的pdf混合了扫描页面和"正常"页面(或经过分割和编辑的页面)，那么您必须扩展和完善上述简单的方法。参见man pdffonts或pdffonts --help了解更多信息。

pdffonts的问题是有时它什么也不返回，像这样:

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------

有时它返回这个:

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
[none]                               Type 3            yes no  no     266  0
[none]                               Type 3            yes no  no       9  0
[none]                               Type 3            yes no  no     297  0
[none]                               Type 3            yes no  no     341  0
[none]                               Type 3            yes no  no     381  0
[none]                               Type 3            yes no  no     394  0
[none]                               Type 3            yes no  no     428  0
[none]                               Type 3            yes no  no     441  0
[none]                               Type 3            yes no  no     451  0
[none]                               Type 3            yes no  no     480  0
[none]                               Type 3            yes no  no     492  0
[none]                               Type 3            yes no  no     510  0
[none]                               Type 3            yes no  no     524  0
[none]                               Type 3            yes no  no     560  0
[none]                               Type 3            yes no  no     573  0
[none]                               Type 3            yes no  no     584  0
[none]                               Type 3            yes no  no     593  0
[none]                               Type 3            yes no  no     601  0
[none]                               Type 3            yes no  no     644  0

考虑到这一点，让我们写一个小文本工具来从pdf中获取所有字体:

pdffonts my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq

如果你的pdf没有OCR，这将不输出任何内容或[none]。

如果你想让它运行得更快，使用-l标记只分析，比如说，前5页:

pdffonts -l 5 my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq

现在用bash脚本封装，例如is-pdf-ocred.sh:

#!/bin/bash
MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    echo "NOT OCR'ed: $1"
else 
    echo "$1 is OCR'ed."
fi

最后，我们希望能够搜索pdf文件。find命令不知道.bashrc中的别名或函数，因此我们需要为它提供脚本的路径。在您选择的目录中运行它，如下所示:

find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' ;

我假设pdf文件以.pdf结束，尽管这并不总是可以做出的假设。您可能希望将其管道传输到less或输出到文本文件中:

find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' ; | less
find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' ; > pdfs.txt

使用-l 5标志，我能够在10秒多一点的时间内完成大约200个pdf。

相关内容

最新更新

热门标签：