如何使用python从语料库中删除单个字符(字母)



我想从语料库中的每个文档中删除任何单个字符。例如,假设有一些拼写错误或非英文字母。

corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']

我试过的是

corpus=' '.join( [w for w in corpus.split() if len(w)>1] )

但没有起作用。有人能帮我吗?

尝试以下

corpus = ['I like this d hotel room because it was clean.', 'This hotel is very y close to downtown area.']
corpus1 = []
for entry in corpus:
corpus1.append(' '.join(x for x in entry.split() if len(x) > 1))
print(corpus1)

输出

['like this hotel room because it was clean.', 'This hotel is very close to downtown area.']

这应该适用于您:

corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']
clean_corpus=[]
for sentence in corpus:
clean_sentence=[]
parts=sentence.split(" ")
for part in parts:
invalid=False
if (len(part)==1) and (part.lower()!="a") and (part.lower()!="i") and (not part.isdigit()):
invalid=True
if not invalid:
clean_sentence.append(part)
clean_corpus.append(" ".join(clean_sentence))
print(clean_corpus)

这将清除所有不是"的单字母单词;a"A"i〃"I";,或一个数字(1,2,3,…(。

你自己试试,并在评论中告诉我它是否有效或可以改进什么!

最新更新