要简短简短,我正在编写一个python脚本,该脚本要求用户删除.docx文件,并且文件转换为.txt。Python在.txt文件中查找关键字,并将其显示到Shell。我遇到了Unicodedecodeerror codec charmap等....如何克服这一点?也许Python跳过了它无法解码的字符并继续阅读其余内容?这是我的代码:
import sys
import os
import codecs
filename = input("Drag and drop resume here: ")
keywords =['NGA', 'DoD', 'Running', 'Programing', 'Enterprise', 'impossible', 'meets']
file_words = []
with open(filename, "rb") as file:
for line in file:
for word in line.split():
word.decode("charmap")
file_words.append(word)
comparison = []
for words in file_words:
if words in keywords:
comparison.append(words)
def remove_duplicates(comparison):
output = []
seen = set()
for words in comparison:
if words not in seen:
output.append(words)
seen.add(words)
return output
comparison = remove_duplicates(comparison)
print ("Keywords found:",comparison)
key_count = 0
word_count = 0
for element in comparison:
word_count += 1
for element in keywords:
key_count += 1
Threshold = word_count / key_count
if Threshold <= 0.7:
print ("The candidate is not qualified for")
else:
print ("The candidate is qualified for")
file.close()
和输出:
Drag and drop resume here: C:UsersUserDesktopResume_Newton Love_151111.txt
Keywords found: []
The candidate is not qualified for
在Python 3中,请勿在二进制模式下打开文本文件。默认值是文件将使用locale.getpreferredencoding(False)
(US Windows上的cp1252
)解码到Unicode:
with open(filename) as file:
for line in file:
for word in line.split():
file_words.append(word)
或指定编码:
with open(filename, encoding='utf8') as file:
for line in file:
for word in line.split():
file_words.append(word)
您 do 需要知道文件的编码。open
还有其他选项,包括errors='ignore'
或errors='replace'
,但是如果您知道正确的编码,则不应该遇到错误。
正如其他人所说的那样,发布了复制错误的文本文件的示例,错误追溯将有助于诊断您的特定问题。
以防有人在乎。已经很长时间了,但是想清除我什至不知道这些天二进制文件和TXT文件之间的区别。我最终找到了一个用于Python的Doc/Docx模块,这使事情变得更容易。对不起,头痛!
也许发布产生跟踪的代码更容易修复。
我不确定这是唯一的问题,也许这会更好:
with open(filename, "rb") as file:
for line in file:
for word in line.split():
file_words.append(word.decode("charmap"))
好吧,我弄清楚了。这是我的代码,但是我尝试了一个似乎更复杂的DOCX文件,然后将其转换为.txt时,整个文件由特殊字符组成。因此,现在我认为我应该去Python-Docx模块,因为它处理了Word Documents之类的XML文件。我添加了"编码='charmap'"
with open(filename, encoding = 'charmap') as file:
for line in file:
for word in line.split():
file_words.append(word)