我想从 Python 3.x 中的句子中删除非英语单词



我有一堆用户查询。其中有一些查询也包含垃圾字符,例如。I work in Google asdasb asnlkasn我只需要I work in Google

import nltk
import spacy
import truecase
words = set(nltk.corpus.words.words())
nlp = spacy.load('en_core_web_lg')
def check_ner(word):
doc = nlp(word)
ner_list = []
for token in doc.ents:
ner_list.append(token.text)
return ner_list

sent = "I work in google asdasb asnlkasn"
sent = truecase.get_true_case(sent)
ner_list = check_ner(sent)
final_sent = " ".join(w for w in nltk.wordpunct_tokenize(sent)if w.lower() in words or not 
w.isalpha() or w in ner_list)

我试过这个,但这并没有删除字符,因为 ner 将google asdasb asnlkasn检测为Work_of_Art或有时asdasb asnlkasn为人。 我必须包括ner,因为words = set(nltk.corpus.words.words())语料库中没有Google,Microsoft,Apple等或任何其他NER值。

您可以使用它来识别您的非单词。

words = set(nltk.corpus.words.words())
sent = "I work in google asdasb asnlkasn"
" ".join(w for w in nltk.wordpunct_tokenize(sent) 
if w.lower() in words or not w.isalpha())

尝试使用这个。感谢@DYZ回答。

但是,既然您说谷歌、苹果等需要 NER,并且导致识别不正确,您可以做的是使用波束解析计算这些 NER 预测的分数。然后,您可以使用分数为 NER 设置可接受的值阈值,并将其降至该阈值以下。我相信这些无意义的词会得到一个低概率分数的分类,如人,如果你不需要它们,你可以完全用来删除诸如艺术作品之类的类别。

使用波束解析进行评分的示例:

import spacy
import sys
from collections import defaultdict
nlp = spacy.load(output_dir)
print("Loaded model '%s'" % output_dir)
text = u'I work in Google asdasb asnlkasn'

with nlp.disable_pipes('ner'):
doc = nlp(text)

threshold = 0.2
(beams) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)
entity_scores = defaultdict(float)
for beam in beams:
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score
print ('Entities and scores (detected with beam search)')
for key in entity_scores:
start, end, label = key
score = entity_scores[key]
if ( score > threshold):
print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))

它在我的测试中起作用,但 NER 未能识别这一点。

最新更新