python Unicode与.txt文件发出



要简短简短,我正在编写一个python脚本,该脚本要求用户删除.docx文件,并且文件转换为.txt。Python在.txt文件中查找关键字,并将其显示到Shell。我遇到了Unicodedecodeerror codec charmap等....如何克服这一点?也许Python跳过了它无法解码的字符并继续阅读其余内容?这是我的代码:

import sys
import os
import codecs
filename = input("Drag and drop resume here: ")
keywords =['NGA', 'DoD', 'Running', 'Programing', 'Enterprise', 'impossible', 'meets']
file_words = []
with open(filename, "rb") as file:
        for line in file:
            for word in line.split():
                word.decode("charmap")
                file_words.append(word)
comparison = []
for words in file_words:
    if words in keywords:
        comparison.append(words)
def remove_duplicates(comparison):
    output = []
    seen = set()
    for words in comparison:
        if words not in seen:
            output.append(words)
            seen.add(words)
    return output
comparison = remove_duplicates(comparison)
print ("Keywords found:",comparison)
key_count = 0
word_count = 0
for element in comparison:
    word_count += 1
for element in keywords:
    key_count += 1
Threshold = word_count / key_count
if Threshold <= 0.7:
    print ("The candidate is not qualified for")
else:
    print ("The candidate is qualified for")
file.close()

和输出:

Drag and drop resume here: C:UsersUserDesktopResume_Newton Love_151111.txt
Keywords found: []
The candidate is not qualified for

在Python 3中,请勿在二进制模式下打开文本文件。默认值是文件将使用locale.getpreferredencoding(False)(US Windows上的cp1252)解码到Unicode:

with open(filename) as file:
    for line in file:
        for word in line.split():
            file_words.append(word)

或指定编码:

with open(filename, encoding='utf8') as file:
    for line in file:
        for word in line.split():
            file_words.append(word)

do 需要知道文件的编码。open还有其他选项,包括errors='ignore'errors='replace',但是如果您知道正确的编码,则不应该遇到错误。

正如其他人所说的那样,发布了复制错误的文本文件的示例,错误追溯将有助于诊断您的特定问题。

以防有人在乎。已经很长时间了,但是想清除我什至不知道这些天二进制文件和TXT文件之间的区别。我最终找到了一个用于Python的Doc/Docx模块,这使事情变得更容易。对不起,头痛!

也许发布产生跟踪的代码更容易修复。

我不确定这是唯一的问题,也许这会更好:

with open(filename, "rb") as file:
    for line in file:
        for word in line.split():
            file_words.append(word.decode("charmap"))

好吧,我弄清楚了。这是我的代码,但是我尝试了一个似乎更复杂的DOCX文件,然后将其转换为.txt时,整个文件由特殊字符组成。因此,现在我认为我应该去Python-Docx模块,因为它处理了Word Documents之类的XML文件。我添加了"编码='charmap'"

with open(filename, encoding = 'charmap') as file:
    for line in file:
        for word in line.split():
            file_words.append(word)

最新更新