我想从语料库中的每个文档中删除任何单个字符。例如,假设有一些拼写错误或非英文字母。
corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']
我试过的是
corpus=' '.join( [w for w in corpus.split() if len(w)>1] )
但没有起作用。有人能帮我吗?
尝试以下
corpus = ['I like this d hotel room because it was clean.', 'This hotel is very y close to downtown area.']
corpus1 = []
for entry in corpus:
corpus1.append(' '.join(x for x in entry.split() if len(x) > 1))
print(corpus1)
输出
['like this hotel room because it was clean.', 'This hotel is very close to downtown area.']
这应该适用于您:
corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']
clean_corpus=[]
for sentence in corpus:
clean_sentence=[]
parts=sentence.split(" ")
for part in parts:
invalid=False
if (len(part)==1) and (part.lower()!="a") and (part.lower()!="i") and (not part.isdigit()):
invalid=True
if not invalid:
clean_sentence.append(part)
clean_corpus.append(" ".join(clean_sentence))
print(clean_corpus)
这将清除所有不是"的单字母单词;a"A"i〃"I";,或一个数字(1,2,3,…(。
你自己试试,并在评论中告诉我它是否有效或可以改进什么!