删除大型文本文件中的标点符号(任务:统计单词,删除常用词,删除标点符号)



我是编程新手。我需要一些帮助来理解从大型文本文件中删除标点符号的代码。我遇到了一些解决方案,并尝试按照以下方式编码:

import string
fname = input("Enter file name: ")
# if len(fname) < 1: fname = "98-0.txt" # Can Enter without typing, but not working 
# on shell
fh = open(fname)
# 1. Read in each word from the file,
# 1a. Making it lower case
# 1b. Removing punctuation. (Optionally, skip common words).
# 1c. For each remaining word, add the word to the data structure or
# update your count for the word
counts = dict()
for line in fh:
line = line.strip() # 1
line = line.lower() # 1a.
line = line.split()
# print(string.punctuation) # Provides all the different punctuations that might 
# exist in a text
print(line.translate(line.maketrans(" ", " ", string.punctuation)))
# print(words)

但是,我得到了一个Traceback:

Traceback (most recent call last):
File "wcloud.py", line 29, in <module>
print(line.translate(line.maketrans(" ", " ", string.punctuation)))
AttributeError: 'list' object has no attribute 'translate'

我试图用最新的python更新Atom(我希望我做的是正确的…我不确定。

正如已经指出的,line = line.split()将您的原始字符串转换为字符串列表-即将您的行拆分为单词。因此,由于translatemaketrans字符串方法,您将需要对列表中的项进行循环:

for word in line:
word.translate(word.maketrans(" ", " ", string.punctuation))

或者,最好在分行之前删除标点:

line.translate(line.maketrans(" ", " ", string.punctuation))
line = line.split()

在后一种情况下,您仍然需要循环line中的单词以将它们添加到count。或者,您可以查看collections.Counter,它可以为您完成这项工作:)

最新更新