在python中,在tweet的末尾获得一行空格



考虑我在python中的代码,minemaggi.txt文件包含推文,我试图删除停止词,但在输出文件推文不在单独的行中。另外,我想从文本文件中删除所有链接,该怎么做呢。

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import codecs
import nltk
stopset = set(stopwords.words('english'))
writeFile = codecs.open("outputfile.txt", "w", encoding='utf-8')
with codecs.open("minemaggi.txt", "r", encoding='utf-8') as f:
           line = f.read()
           new = 'n'
           tokens = nltk.word_tokenize(line)
           tokens = [w for w in tokens if not w in stopset]
           for token in tokens:
                writeFile.write('{}{}'.format(' ', token))
           writeFile.write('{}'.format(new))

您需要显式地向写入文件的字符串添加换行符,如下所示:

writeFile.write('{}{}n'.format(' ', token))

我会使用' '.join()重新连接单词,然后每次写一行:

with codecs.open("minemaggi.txt", "r", encoding='utf-8') as f:
    # loop over all lines in the input-file
    for line in f:
       # as before remove the stopwords ...
       tokens = nltk.word_tokenize(line)
       tokens = [w for w in tokens if not w in stopset]
       # Rejoin the words separated by one space.
       line = ' '.join(tokens)
       # Write processed line to the output file.
       writeFile.write('{}n'.format(line))

希望对你有帮助。

相关内容

最新更新