我有.docx文件,里面有很多段落和表格,比如:
- par1
- 表1
- 表2
- 表3
-
par2
- 表1
- 表2
2.1第21部分
- 表1
- 表2
我需要迭代所有对象并制作字典,可能是json格式,比如:
{par1:[表1,表2,表3],par2[表1、表2,{par21:[表1和表2]}]}
从docx.api导入文档filename="test.docx"document=文档(docx=文件名)对于document.tables中的表:打印表格对于文档中的段落。段落:打印段落.text
我如何将每个段落和表格联系起来?
你能提个建议吗?
from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
def iter_block_items(parent):
"""
Generate a reference to each paragraph and table child within *parent*,
in document order. Each returned value is an instance of either Table or
Paragraph. *parent* would most commonly be a reference to a main
Document object, but also works for a _Cell object, which itself can
contain paragraphs and tables.
"""
if isinstance(parent, _Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
elif isinstance(parent, _Row):
parent_elm = parent._tr
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
document = Document('test.docx')
for block in iter_block_items(document):
#print(block.text if isinstance(block, Paragraph) else '<table>')
if isinstance(block, Paragraph):
print(block.text)
elif isinstance(block, Table):
for row in block.rows:
row_data = []
for cell in row.cells:
for paragraph in cell.paragraphs:
row_data.append(paragraph.text)
print("t".join(row_data))
不确定我是否在帮忙,但以下是我的方法
def printTables(doc):
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
print(paragraph.text)
printTables(cell)
此函数返回文档或文档部分的段落和表格(您可以在表格中包含段落和表格):
def iter_block_items(parent):
# https://github.com/python-openxml/python-docx/issues/40
from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
# print('parent_elm: '+str(type(parent_elm)))
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent) # No recursion, return tables as tables
# table = Table(child, parent) # Use recursion to return tables as paragraphs
# for row in table.rows:
# for cell in row.cells:
# yield from iter_block_items(cell)
现在,要使用它,请在字典中写Do some logic here
:
document = Document(filepath)
for iter_block_item in iter_block_items(document): # Iterate over paragraphs and tables
# print('iter_block_item type: '+str(type(iter_block_item)))
if isinstance(iter_block_item, Paragraph):
paragraph = iter_block_item # Do some logic here
else:
table = iter_block_item # Do some logic here
注意:@Elayaraja Dev的答案无法编辑,此答案的当前形式为iter_block_items(因为更新了docx内部)
在python-docx库上还没有实现这样的方法,但有一个变通方法可以按显示顺序迭代docx的所有元素:https://github.com/python-openxml/python-docx/issues/40
您可以尝试迭代所有这些,检查对象是否是表或段落的实例,并以此为基础进行逻辑。