转换 pytesseract。Output.DATAFRAME into bytes or ocr'ed pdf

是否可以使用pytesseract.image_to_data()输出追溯写入pdf文件？

对于我的OCR管道，我需要对pdf的OCR'ed数据进行细粒度访问。我要求使用这种方法：

ocr_dataframe = pytesseract.image_to_data(
tesseract_image, 
output_type=pytesseract.Output.DATAFRAME,
config=PYTESSERACT_CUSTOM_CONFIG
)

现在，我想使用pdfplumber从pdf中提取一些表格数据。但是，必须使用以下三种输入中的一种输入：

PDF文件的路径
文件对象，加载为字节
类似文件的对象，加载为字节

我知道我可以使用pytesseract使用以下方法将我的原始pdf转换为可搜索的pdf(以字节表示(：

# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')

然而，我想避免ocr’ing我的pdfs两次。是否可以将pytesseract.image_to_data()的输出与原始图像相结合，并创建某种字节表示？

任何帮助都将不胜感激！

好吧，所以我很确定这是我试图完成的一项不可能完成的任务。

pytesseract.Output.DATAFRAME生成熊猫数据帧。数据结构中没有任何地方是原始图像。输出只是文本数据的行和列。没有像素，就什么都没有。

相反，我创建了一个类，可以同时保存原始图像和ocr输出数据帧。以下是实例初始化的样子：

def __init__(self, temp_image_path):

self.image_path = pathlib.Path(temp_image_path)
self.image = cv2.imread(temp_image_path, cv2.IMREAD_GRAYSCALE)
self.ocr_dataframe = self.ocr()
def ocr(self):

#########################################
# Preprocess image in prep for pytesseract ocr
########################################
tesseract_image = ocr_preprocess(self.image)
########################################
# OCR image using pytesseract
########################################
ocr_dataframe = pytesseract.image_to_data(
tesseract_image, 
output_type=pytesseract.Output.DATAFRAME,
config=PYTESSERACT_CUSTOM_CONFIG
)

return ocr_dataframe

这可能有点占用内存，但我想避免写很多图像。

相关内容

最新更新

热门标签：