python-docx从段落中获取表格



我有.docx文件,里面有很多段落和表格,比如:

  1. par1
    • 表1
    • 表2
    • 表3
  2. par2

    • 表1
    • 表2

    2.1第21部分

    • 表1
    • 表2

我需要迭代所有对象并制作字典,可能是json格式,比如:

{par1:[表1,表2,表3],par2[表1、表2,{par21:[表1和表2]}]}
从docx.api导入文档filename="test.docx"document=文档(docx=文件名)对于document.tables中的表:打印表格对于文档中的段落。段落:打印段落.text

我如何将每个段落和表格联系起来?

你能提个建议吗?

from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
def iter_block_items(parent):
    """
    Generate a reference to each paragraph and table child within *parent*,
    in document order. Each returned value is an instance of either Table or
    Paragraph. *parent* would most commonly be a reference to a main
    Document object, but also works for a _Cell object, which itself can
    contain paragraphs and tables.
    """
    if isinstance(parent, _Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    elif isinstance(parent, _Row):
        parent_elm = parent._tr
    else:
        raise ValueError("something's not right")
    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)
document = Document('test.docx')
for block in iter_block_items(document):
    #print(block.text if isinstance(block, Paragraph) else '<table>')
    if isinstance(block, Paragraph):
        print(block.text)
    elif isinstance(block, Table):
        for row in block.rows:
            row_data = []
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    row_data.append(paragraph.text)
            print("t".join(row_data))

不确定我是否在帮忙,但以下是我的方法

def printTables(doc):
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    print(paragraph.text)
                printTables(cell)

此函数返回文档或文档部分的段落和表格(您可以在表格中包含段落和表格):

def iter_block_items(parent):
    # https://github.com/python-openxml/python-docx/issues/40
    from docx.document import Document
    from docx.oxml.table import CT_Tbl
    from docx.oxml.text.paragraph import CT_P
    from docx.table import _Cell, Table
    from docx.text.paragraph import Paragraph
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")
    # print('parent_elm: '+str(type(parent_elm)))
    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)  # No recursion, return tables as tables
        # table = Table(child, parent)  # Use recursion to return tables as paragraphs       
        # for row in table.rows:
        #     for cell in row.cells:
        #         yield from iter_block_items(cell)          

现在,要使用它,请在字典中写Do some logic here:

document = Document(filepath)
for iter_block_item in iter_block_items(document): # Iterate over paragraphs and tables
# print('iter_block_item type: '+str(type(iter_block_item)))
        if isinstance(iter_block_item, Paragraph):
                paragraph = iter_block_item  # Do some logic here
            else:
                table = iter_block_item      # Do some logic here

注意:@Elayaraja Dev的答案无法编辑,此答案的当前形式为iter_block_items(因为更新了docx内部)

在python-docx库上还没有实现这样的方法,但有一个变通方法可以按显示顺序迭代docx的所有元素:https://github.com/python-openxml/python-docx/issues/40

您可以尝试迭代所有这些,检查对象是否是表或段落的实例,并以此为基础进行逻辑。

相关内容

  • 没有找到相关文章

最新更新