如何使用pytesseract从工资单图像中提取指定的文本



我是tesseract OCR的新手,我有一堆工资单的图像,我想自动从工资单中提取日期,请帮我怎么做,

首先,我试图从一张工资单中提取数据,它显示错误:

import cv2
import pytesseract
img = cv2.imread(r'E:/Receipts/Receipts/0a0ebd53.jpeg')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'
TESSDATA_PREFIX='C:/Program Files/Tesseract-OCR/tessdata'
print(pytesseract.image_to_string(img))
# OR explicit beforehand converting
print(pytesseract.image_to_string(Image.fromarray(img))) 

错误:

200         }
201 
--> 202         run_tesseract(**kwargs)
203         filename = kwargs['output_filename_base'] + os.extsep + extension
204         with open(filename, 'rb') as output_file:
~Anaconda3libsite-packagespytesseractpytesseract.py in run_tesseract(input_filename, output_filename_base, extension, lang, config, nice)
176 
177     if status_code:
--> 178         raise TesseractError(status_code, get_errors(error_string))
179 
180     return True
TesseractError: (1, 'Error opening data file C:\Program Files (x86)\Tesseract-OCR\eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')

请帮助我如何修复这个错误,也请给我一个深度学习模型的建议。

请使用PIL库读取图像,然后将图像对象传递给Image_to_string(img_obj(,如下所示。

from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:/Program Files/TesseractOCR/tesseract.exe"
image_obj = Image.open(image_path)
print(pytesseract.image_to_string(image_obj))

最新更新