Python-docx和ElemetTree:如何找到段落中超链接的位置

我正在使用python-docx和ElementTree将Word文档转换为XML，除了超链接外，它运行良好。

我能够找到哪些python-docx段落具有超链接，但是如果超链接位于段落文本的中间，则在将输出写入XML时，我不知道在哪里呈现超链接。

有没有办法循环访问段落中的所有元素？如果我理解正确，循环运行只会考虑<w:r>元素，所以我希望我的超链接元素将位于 2 次运行之间。我怎么知道是哪两个？

python-docx使用lxml来处理其底层XML。我希望如果你坚持这一点而不是引入 Python 的xml.etree.ElementTree，你会做得更好，如果这就是你所说的ElementTree。

对于段落，可以通过调用以下内容来生成基础 XML 字符串：

for paragraph in document.paragraphs:
print(paragraph._p.xml)

您还可以使用所有其他lxml.etree._Element方法，以及一个python-docx重载的.xpath()方法，该方法允许您编写带有命名空间前缀而不是整个命名空间 URL 的表达式，如paragraph._p.xpath("w:rPr")。

我知道我有点晚了，但也许有人会发现这个答案很有用。假设您在MS Word文档中有一个段落，其中包含一个超链接，如下所示：

一个早已确定的事实是，读者在查看页面布局时会被页面的可读内容分散注意力。https://www.google.com/使用Lorem Ipsum的要点是它具有或多或少的字母正态分布，而不是使用"此处的内容，此处的内容"，使其看起来像可读的英语。

当您 1( 将.docx文件的扩展名更改为 .docx.zip 或 2( 通过打印 xml => print(paragraph._p.xml( 时，您可以检查它在 xml 中的外观 !)当您查看 document.xml 文件时，您将看到如下所示的内容：

<w:hyperlink w:history="1" r:id="rId9">
<w:r w:rsidR="000D6596" w:rsidRPr="00302570">
<w:rPr>
<w:rStyle w:val="Hipercze"/>
<w:rFonts w:cs="Arial"/>
<w:spacing w:val="-4"/>
</w:rPr>
<w:t>https/google.com</w:t>
</w:r>
</w:hyperlink>

然后，您可以找到文档中所有超链接的关系 ID(如果您有多个超链接，您可能希望将 rId 保存到列表中(：

import docx 
from docx.oxml.ns import qn

for paragraph in document.paragraphs:
hyperlink = paragraph._p.xpath("./w:hyperlink")
if len(hyperlink) > 0:
hyperlink = hyperlink[0]
hyperlink_rel_id = hyperlink.get(qn("r:id"))

获得rId后，您可以访问该链接，删除，修改等。在此处获取 rId 的另一种方法：

from docx.opc.constants import RELATIONSHIP_TYPE as RT

link_text = 'https://www.google.pl/'
document = docx.Document(path)
rels = document.part.rels
for rel in rels:
if rels[rel].reltype == RT.HYPERLINK:
if rels[rel]._target == link_text:
# if the text of the link is the same as the one you are looking for 
# do something

相关内容

最新更新

热门标签：