通过python读取Docx文件

有人知道读取docx文件的python库吗？

我有一个word文档，我正试图从中读取数据。

有几个包可以让您做到这一点。检查

python文档。
docx2txt（请注意，它似乎不适用于.doc）。根据这一点，它似乎比python docx获得了更多的信息。来自原始文档：

import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")

textract（通过docx2txt工作）。
由于.docx文件只是扩展名已更改的.zip文件，因此将显示如何访问内容。这是与.doc文件的显著差异，也是上述部分（或全部）文件无法与.doc s一起使用的原因。在这种情况下，您可能需要先转换doc->docx。antiword是一个选项。

python-docx既可以读也可以写。

doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
    allText.append(docpara.text)

现在所有段落都将在列表allText中。

感谢Al Sweigart的"如何用Python自动处理无聊的东西"。

查看这个允许读取docx文件的库https://python-docx.readthedocs.io/en/latest/

您应该使用PyPi上提供的python-docx库。然后您可以使用以下

doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
    allText.append(docpara.text)

快速搜索PyPI可以找到docx包。

import docx
def main():
    try:
        doc = docx.Document('test.docx')  # Creating word reader object.
        data = ""
        fullText = []
        for para in doc.paragraphs:
            fullText.append(para.text)
            data = 'n'.join(fullText)
        print(data)
    except IOError:
        print('There was an error opening the file!')
        return

if __name__ == '__main__':
    main()

不要忘记使用（pip-install-python-docx）

安装python-docx

相关内容

最新更新

热门标签：