python Unicode与.txt文件发出

要简短简短，我正在编写一个python脚本，该脚本要求用户删除.docx文件，并且文件转换为.txt。Python在.txt文件中查找关键字，并将其显示到Shell。我遇到了Unicodedecodeerror codec charmap等....如何克服这一点？也许Python跳过了它无法解码的字符并继续阅读其余内容？这是我的代码：

import sys
import os
import codecs
filename = input("Drag and drop resume here: ")
keywords =['NGA', 'DoD', 'Running', 'Programing', 'Enterprise', 'impossible', 'meets']
file_words = []
with open(filename, "rb") as file:
        for line in file:
            for word in line.split():
                word.decode("charmap")
                file_words.append(word)
comparison = []
for words in file_words:
    if words in keywords:
        comparison.append(words)
def remove_duplicates(comparison):
    output = []
    seen = set()
    for words in comparison:
        if words not in seen:
            output.append(words)
            seen.add(words)
    return output
comparison = remove_duplicates(comparison)
print ("Keywords found:",comparison)
key_count = 0
word_count = 0
for element in comparison:
    word_count += 1
for element in keywords:
    key_count += 1
Threshold = word_count / key_count
if Threshold <= 0.7:
    print ("The candidate is not qualified for")
else:
    print ("The candidate is qualified for")
file.close()

和输出：

Drag and drop resume here: C:UsersUserDesktopResume_Newton Love_151111.txt
Keywords found: []
The candidate is not qualified for

在Python 3中，请勿在二进制模式下打开文本文件。默认值是文件将使用locale.getpreferredencoding(False)（US Windows上的cp1252）解码到Unicode：

with open(filename) as file:
    for line in file:
        for word in line.split():
            file_words.append(word)

或指定编码：

with open(filename, encoding='utf8') as file:
    for line in file:
        for word in line.split():
            file_words.append(word)

您 do 需要知道文件的编码。open还有其他选项，包括errors='ignore'或errors='replace'，但是如果您知道正确的编码，则不应该遇到错误。

正如其他人所说的那样，发布了复制错误的文本文件的示例，错误追溯将有助于诊断您的特定问题。

以防有人在乎。已经很长时间了，但是想清除我什至不知道这些天二进制文件和TXT文件之间的区别。我最终找到了一个用于Python的Doc/Docx模块，这使事情变得更容易。对不起，头痛！

也许发布产生跟踪的代码更容易修复。

我不确定这是唯一的问题，也许这会更好：

with open(filename, "rb") as file:
    for line in file:
        for word in line.split():
            file_words.append(word.decode("charmap"))

好吧，我弄清楚了。这是我的代码，但是我尝试了一个似乎更复杂的DOCX文件，然后将其转换为.txt时，整个文件由特殊字符组成。因此，现在我认为我应该去Python-Docx模块，因为它处理了Word Documents之类的XML文件。我添加了"编码='charmap'"

with open(filename, encoding = 'charmap') as file:
    for line in file:
        for word in line.split():
            file_words.append(word)

相关内容

最新更新

热门标签：