我正在清理一个文本文件,并编写了以下代码来删除不需要的字符。我的问题是,当我希望它作为文本组成时,最终输出显示为单词列表。我认为问题在于这一行,它旨在通过替换新行来删除换行符,即"(n)"与">
Step4 = re.sub(r"(n)"," ",Step3)
print(Step4)
完整代码如下:
f=open("/Applications/Python 3.9/cleaning text.txt",encoding='Latin-1')
raw=f.read()
#print(raw)
import re
import nltk
from nltk import word_tokenize
Data = re.split(r" ",raw)
for D in Data:
# print(str(raw)+'n')
Step1 = re.sub(r"(\.*)","",D)
# print(Step1)
Step2 = re.sub(r"(M)","hl",Step1)
# print(Step2)
Step3 = re.sub(r"([aa])","[a::]",Step2)
# print(Step3)
Step4 = re.sub(r"(n)"," ",Step3)
print(Step4)
我认为你不需要将整个文本逐字拆分为列表。您可以将原始数据作为输入提供给re.sub()函数。如果要从原始数据的开头或结尾删除空格字符,可以使用strip()函数。
f=open("/Applications/Python 3.9/cleaning text.txt",encoding='Latin-1')
raw=f.read()
import re
raw = str(raw).strip()
Step1 = re.sub(r"(\.*)","",raw)
Step2 = re.sub(r"(M)","hl",Step1)
Step3 = re.sub(r"([aa])","[a::]",Step2)
Step4 = re.sub(r"(n)"," ",Step3)