从DOC和DOCX获取文字

我使用的是带有Windows 7和Python 3.3的计算机。在我的组织中，我们有成千上万的文档没有组织。我想创建一个打开DOC/DOCX文件的程序，搜索文本以获取某些关键字，然后重新安排文档。我正在寻找一种搜索Word文件(DOC/DOCX(的文本以查找某些单词，必须在Windows上，并且必须同时搜索DOC和DOCX。

有什么想法吗？

a .docx文档是openxml格式的zip档案：您首先要取消压缩。

之后，您可以运行：

# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found    
search(document,'your search string')

一个人可以使用Textract库。它可以照顾" doc"one_answers" docx"

import textract
text = textract.process("path/to/file.extension")

您甚至可以使用'andiword'(sudo apt-get安装反词(，然后将doc转换为docx，然后通过docx2txt读取。

antiword filename.doc> filename.docx最终，后端的Swectract使用反词。

相关内容

最新更新

热门标签：