将img标签替换为相应ocr标签的文本



我有一个docx文件,从中提取它包含的所有文本。这个文件包含许多图像,感谢tika,我可以从文档中提取文本,从图像中提取文本。

我需要的是用相应的文本替换图像标签。

我用python和beautiful soup来做这个。

我把xml文件留在这里,看看是否有人能帮我。


<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2018-09-24T10:14:00Z" />
<meta name="extended-properties:DocSecurity" content="4" />
<meta name="extended-properties:AppVersion" content="16.0000" />
<meta name="meta:paragraph-count" content="18" />
<meta name="Word-Count" content="1403" />
<meta name="meta:line-count" content="66" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Name" content="General" />
<meta name="Template" content="Normal" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Extended_MSFT_Method" content="Automatic" />
<meta name="Paragraph-Count" content="18" />
<meta name="meta:character-count-with-spaces" content="9386" />
<meta name="dc:title" content="Introduction to Blob storage - Object storage in Azure | Microsoft Docs" />
<meta name="modified" content="2018-09-24T10:14:00Z" />
<meta name="meta:author" content="Aya Kamel" />
<meta name="meta:creation-date" content="2018-09-24T10:14:00Z" />
<meta name="extended-properties:Application" content="Microsoft Office Word" />
<meta name="Creation-Date" content="2018-09-24T10:14:00Z" />
<meta name="Character-Count-With-Spaces" content="9386" />
<meta name="Last-Author" content="Tulasi Menon" />
<meta name="Character Count" content="8001" />
<meta name="Page-Count" content="10" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SiteId" content="72f988bf-86f1-41af-91ab-2d7cd011db47" />
<meta name="Application-Version" content="16.0000" />
<meta name="extended-properties:Template" content="Normal" />
<meta name="custom:Sensitivity" content="General" />
<meta name="Author" content="Aya Kamel" />
<meta name="publisher" content="" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Owner" content="aykame@microsoft.com" />
<meta name="meta:page-count" content="10" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Enabled" content="True" />
<meta name="cp:revision" content="2" />
<meta name="meta:word-count" content="1403" />
<meta name="dc:creator" content="Aya Kamel" />
<meta name="extended-properties:Company" content="" />
<meta name="dcterms:created" content="2018-09-24T10:14:00Z" />
<meta name="dcterms:modified" content="2018-09-24T10:14:00Z" />
<meta name="Last-Modified" content="2018-09-24T10:14:00Z" />
<meta name="Last-Save-Date" content="2018-09-24T10:14:00Z" />
<meta name="meta:character-count" content="8001" />
<meta name="Line-Count" content="66" />
<meta name="meta:save-date" content="2018-09-24T10:14:00Z" />
<meta name="Application-Name" content="Microsoft Office Word" />
<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
<meta name="creator" content="Aya Kamel" />
<meta name="meta:last-author" content="Tulasi Menon" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SetDate" content="2018-09-23T15:37:34.0264484Z" />
<meta name="xmpTPg:NPages" content="10" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Application" content="Microsoft Azure Information Protection" />
<meta name="Revision-Number" content="2" />
<meta name="extended-properties:DocSecurityString" content="ReadOnlyEnforced" />
<meta name="dc:publisher" content="" />
<title>Introduction to Blob storage - Object storage in Azure | Microsoft Docs</title>
</head>
<body><p><a name="_GoBack" />Manage Azure Blob Storage resources with Storage Explorer</p>
<p> </p>
<h1>Overview</h1>
<p><a href="https://learn.microsoft.com/en-us/azure/storage/blobs/storage-dotnet-how-to-use-blobs">Azure Blob Storage</a> is a service for storing large amounts of unstructured data, such as text or binary data, that can be accessed from anywhere in the world via HTTP or HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately. In this article, you'll learn how to use Storage Explorer to work with blob containers and blobs.</p>
<h1>Prerequisites</h1>
<p>To complete the steps in this article, you'll need the following:</p>
<p><a href="http://www.storageexplorer.com/">Download and install Storage Explorer</a></p>
<p>Connect to a Azure storage account or service</p>
<h1>Create a blob container</h1>
<p>All blobs must reside in a blob container, which is simply a logical grouping of blobs. An account can contain an unlimited number of containers, and each container can store an unlimited number of blobs.</p>
<p>The following steps illustrate how to create a blob container within Storage Explorer.</p>
<p>1. Open Storage Explorer.</p>
<p>2. In the left pane, expand the storage account within which you wish to create the blob container.</p>
<p>3. Right-click <b>Blob Containers</b>, and - from the context menu - select <b>Create Blob Container</b>.</p>
<p>4. A text box will appear below the <b>Blob Containers</b> folder. Enter the name for your blob container. See the Create the container and set permissions for information on rules and restrictions on naming blob containers.</p>
<p><img src="embedded:image2.jpg" alt="" /></p>
<p>5. Press <b>Enter</b> when done to create the blob container, or <b>Esc</b> to cancel. Once the blob container has been successfully created, it will be displayed under the <b>Blob Containers</b> folder for the selected storage account.</p>
<p><img src="embedded:image3.jpg" alt="" /></p>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Number of Tables" content="4 Huffman tables" />
<meta name="Compression Type" content="Baseline" />
<meta name="Data Precision" content="8 bits" />
<meta name="Number of Components" content="3" />
<meta name="tiff:ImageLength" content="124" />
<meta name="Component 2" content="Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert" />
<meta name="Thumbnail Height Pixels" content="0" />
<meta name="Component 1" content="Y component: Quantization table 0, Sampling factors 2 horiz/2 vert" />
<meta name="X Resolution" content="96 dots" />
<meta name="embeddedRelationshipId" content="rId10" />
<meta name="File Size" content="10645 bytes" />
<meta name="Component 3" content="Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert" />
<meta name="File Name" content="apache-tika-10777883143042172609.tmp" />
<meta name="tiff:BitsPerSample" content="8" />
<meta name="Content-Type" content="image/jpeg" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.jpeg.JpegParser" />
<meta name="Resolution Units" content="inch" />
<meta name="File Modified Date" content="Mon Jul 11 10:30:38 +00:00 2022" />
<meta name="resourceName" content="image2.jpg" />
<meta name="Image Height" content="124 pixels" />
<meta name="Thumbnail Width Pixels" content="0" />
<meta name="Image Width" content="290 pixels" />
<meta name="X-TIKA:embedded_depth" content="1" />
<meta name="X-TIKA:embedded_resource_path" content="/image2.jpg" />
<meta name="tiff:ImageWidth" content="290" />
<meta name="Y Resolution" content="96 dots" />
<title></title>
</head>
<body><div class="ocr"> 

4B tarcher
4 Bi Blob Containers
Bho,
TM Queues
&gt; By Tables
FS 22016103 1050;



</div>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Number of Tables" content="4 Huffman tables" />
<meta name="Compression Type" content="Baseline" />
<meta name="Data Precision" content="8 bits" />
<meta name="Number of Components" content="3" />
<meta name="tiff:ImageLength" content="124" />
<meta name="Component 2" content="Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert" />
<meta name="Thumbnail Height Pixels" content="0" />
<meta name="Component 1" content="Y component: Quantization table 0, Sampling factors 2 horiz/2 vert" />
<meta name="X Resolution" content="96 dots" />
<meta name="embeddedRelationshipId" content="rId11" />
<meta name="File Size" content="10533 bytes" />
<meta name="Component 3" content="Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert" />
<meta name="File Name" content="apache-tika-9241358526221461145.tmp" />
<meta name="tiff:BitsPerSample" content="8" />
<meta name="Content-Type" content="image/jpeg" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.jpeg.JpegParser" />
<meta name="Resolution Units" content="inch" />
<meta name="File Modified Date" content="Mon Jul 11 10:30:38 +00:00 2022" />
<meta name="resourceName" content="image3.jpg" />
<meta name="Image Height" content="124 pixels" />
<meta name="Thumbnail Width Pixels" content="0" />
<meta name="Image Width" content="290 pixels" />
<meta name="X-TIKA:embedded_depth" content="1" />
<meta name="X-TIKA:embedded_resource_path" content="/image3.jpg" />
<meta name="tiff:ImageWidth" content="290" />
<meta name="Y Resolution" content="96 dots" />
<title></title>
</head>
<body><div class="ocr"> 
IM Queues
</div></body></html>

在本例中,例如,我需要替换标签:

<img src="embedded:image2.jpg" alt="" />

使用出现在带有OCR类的div标签内的文本和这个元标签名称:

<meta name="resourceName" content="image2.jpg" />

这个div标签中的文本是用来替换img标签的:

<body><div class="ocr"> 

4B tarcher
4 Bi Blob Containers
Bho,
TM Queues
&gt; By Tables
FS 22016103 1050;



</div>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">

最后我解决了这个问题,我在这里留下了代码:

from bs4 import BeautifulSoup
def replace_image_for_text(self, xml):
soup = BeautifulSoup(xml, 'html.parser')
headers = soup.find_all("head")
bodies = soup.find_all("body")
resource_names = []
resource_texts = []
for header, body in zip(headers, bodies):
resource_name = header.find("meta", {"name": "resourceName"})
if resource_name:
resource_names.append(header.find("meta",{"name": "resourceName"})['content'])
resource_texts.append(body.get_text())

for resource_name, resource_text in zip(resource_names, resource_texts):
soup.find("img",{"src":f"embedded:{resource_name}"}).replace_with(resource_text)

return soup