使用 Python 从 Word 中提取 XML 代码时出现问题

我正在尝试使用Python从Word文档中提取XML代码。这是我尝试过的代码：

def getXml(docxFilename):
    zip = zipfile.ZipFile(open(docxFilename,"rb"))
    xmlString= str(zip.read("word/document.xml"))
    return xmlString

我创建了一个测试文档并在其上运行了函数getXML。结果如下：

 b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>rn<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidRDefault="00B52719"><w:pPr><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr></w:pPr><w:r><w:t>Test</w:t></w:r></w:p><w:sectPr w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidSect="009C4305"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'

有一些明显的问题。一种是XML代码以"b"开头并以撇号结尾。其次，在第一组尖括号之后有一个"\r"。

我的最终目标是修改XML代码以创建一个新的Word文档 - 请参阅此问题 - 但是提取的XML的异常使我无法执行此操作。

有谁知道为什么提取的XML具有这些奇怪的功能以及如何删除它们？

编辑：我尝试使用lxml模块来解析此代码，但我只得到不同的错误。

我创建了一个新函数getXmlTree：

from lxml import etree
def getXmlTree(xmlString):
    return etree.fromstring(xmlString)

然后，我运行了代码etree.tostring(getXmlTree(getXml("test.docx")),pretty_print=True)，并收到了更合理的XML代码。

当我尝试创建新的Word文档时出现问题。我创建了以下函数来将XML代码转换为Word文档(从这里无耻地窃取(：

import zipfile
from lxml import etree
import os
import tempfile
import shutil
def createNewDocx(originalDocx,xmlContent,newFilename):
    tmpDir = tempfile.mkdtemp()
    zip = zipfile.ZipFile(open(originalDocx,"rb"))
    zip.extractall(tmpDir)
    with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
        xmlString = etree.tostring(xmlContent,pretty_print=True)
        f.write(xmlString)
    filenames = zip.namelist()
    zipCopyFilename = newFilename
    with zipfile.ZipFile(zipCopyFilename,"w") as docx:
        for filename in filenames:
            docx.write(os.path.join(tmpDir,filename),filename)
    shutil.rmtree(tmpDir)

在尝试创建新的 Word 文档之前，我想看看是否可以通过在上述函数中将xmlContent = getXmlTree(getXml("test.docx"))替换为参数来创建原始测试文档的副本。但是，当我运行代码时，我收到一条错误消息：

f.write(xmlString)
TypeError: must be str, not bytes

相反，使用f.write(str(xmlString))没有帮助;它创建了一个新的Word文档，但是如果我尝试打开它，Word会崩溃。

EDIT2：尝试使用 f.write(xmlString.decode("utf-8")) 运行上述代码，但没有帮助;话还是崩溃了。

我的猜测是XML没有正确编码。首先，使用 "wb" 作为模式将文档文件写入二进制文件。其次，告诉etree.tostring()编码是什么，并包含 XML 声明。

with open(os.path.join(tmpDir, "word/document.xml"), "wb") as f:
    xmlBytes = etree.tostring(xmlContent, encoding="UTF-8", xml_declaration=True, pretty_print=True)
    f.write(xmlBytes)

相关内容

最新更新

热门标签：