PyTesseract 多页 tiff 图像的错误



当我在多页 Tiff 图像中阅读 15 页并且是白色背景中的黑色字母/单词的文档时,PyTesseract 在我循环页面并转换为字符串的步骤中抛出"OSError:-9"错误。

我使用pytesseract包和pyocr.builders。单个页面似乎工作正常,但我相信当图像不在 RGB 中时会出现错误,程序会转换为 RGB。

img = Image.open(r'usersaitext.tiff')
img.load()
txt = ""
for frame in range(0, img.n_frames):
img.seek(frame)
txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())

预期输出是 jupyter 窗口中显示的所有 15 页。

错误信息

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-17-e59bdf3b773c> in <module>
2 for frame in range(0, img.n_frames):
3     img.seek(frame)
----> 4     txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())
5 
~AppDataLocalContinuumanaconda3libsite-packagespyocrtesseract.py in image_to_string(image, lang, builder)
357     with tempfile.TemporaryDirectory() as tmpdir:
358         if image.mode != "RGB":
--> 359             image = image.convert("RGB")
360         image.save(os.path.join(tmpdir, "input.bmp"))
361         (status, errors) = run_tesseract("input.bmp", "output", cwd=tmpdir,
~AppDataLocalContinuumanaconda3libsite-packagesPILImage.py in convert(self, mode, matrix, dither, palette, colors)
932         """
933 
--> 934         self.load()
935 
936         if not mode and self.mode == "P":
~AppDataLocalContinuumanaconda3libsite-packagesPILTiffImagePlugin.py in load(self)
1097     def load(self):
1098         if self.use_load_libtiff:
-> 1099             return self._load_libtiff()
1100         return super(TiffImageFile, self).load()
1101 
~AppDataLocalContinuumanaconda3libsite-packagesPILTiffImagePlugin.py in _load_libtiff(self)
1189 
1190         if err < 0:
-> 1191             raise IOError(err)
1192 
1193         return Image.Image.load(self)
OSError: -9

对于这样的问题,您应该提供一个最小可重现的示例,因为遗漏了一些代码。此外,还应提供测试图像。但是,对于此示例,您无法附加多页 TIFF,因此指向一个 TIFF 的链接会很好。

我能够从这个问题中找到这个测试图像。这是一本10页的TIFF。

以下是使用 pyocr 的解决方案:

from PIL import Image
import pytesseract
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
tool = tools[0]
# pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'

image = Image.open('multipage_tiff_example.tif')
# set Page Segmentation Mode to 6 
# (i.e. assume a single uniform block of text)
config = ("--psm 6")
txt = ''
for frame in range(image.n_frames):
image.seek(frame)
txt = tool.image_to_string(image, builder=pyocr.builders.TextBuilder())
print(txt)

这是一个使用pytesseract的解决方案:

from PIL import Image
import pytesseract
# pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
image = Image.open('multipage_tiff_example.tif')
# set Page Segmentation Mode to 6 
# (i.e. assume a single uniform block of text)
config = ("--psm 6")
txt = ''
for frame in range(image.n_frames):
image.seek(frame)
txt += pytesseract.image_to_string(image, config = config, lang='eng') + 'n'

print(txt)

两者都给出以下输出:

Multipage
TIFF
Example
Page 1
Multipage
TIFF
Example
Page 2
Multipage
TIFF
Example
Page 3
Multipage
TIFF
Example
Page 4
Multipage
TIFF
Example
Page5
Multipage
TIFF
Example
Page 6
Multipage
TIFF
Example
Page /
Multipage
TIFF
Example
Page 8
Multipage
TIFF
Example
Page 9
Multipage
TIFF
Example
Page 10

最新更新