在制作NER模型时处理未知单词



我正在处理我在Python的Keras库中制作的自定义命名实体识别模型。我读到过,我应该列举所有出现的单词,这样我就得到了矢量化的序列。我已经这样做了:

word2idx = {w: i + 1 for i, w in enumerate(words)}
label2idx = {t: i for i, t in enumerate(labels)}
# CREATING FEATURES(X) AND RESULTS(Y)
max_len = 50 
num_words = len(num_words) #number of unique words in dataset
X = [[word2idx[w[0]] for w in s] for s in list_of_sentances]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[label2idx[w[1]] for w in s] for s in list_of_sentances]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=label2idx["O"])
y = [to_categorical(i, num_classes=num_labels) for i in y]

这是我的最终模型:

input_word = Input(shape=(max_len,))
model = Embedding(input_dim = num_words, output_dim = 50, input_length = max_len)(input_word)
model = SpatialDropout1D(0.2)(model)
model = Bidirectional(LSTM(units = 5, return_sequences=True, recurrent_dropout = 0.1))(model)
out = TimeDistributed(Dense(num_labels, activation = "softmax"))(model)
model = Model(input_word, out)
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 30)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 30, 50)            2187550   
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 30, 50)            0         
_________________________________________________________________
bidirectional (Bidirectional (None, 30, 10)            2240      
_________________________________________________________________
time_distributed (TimeDistri (None, 30, 11)            121       
=================================================================
Total params: 2,189,911  #LOOK AD THIS NUMBER
Trainable params: 2,189,911
Non-trainable params: 0

我的准确率是98%,损失是0.07。我喜欢这些结果,但是我对预测有问题,因为缺少单词。例如:

text = "I live in the Ohio and my name is Alex Wright and I work in AvcCC LTD"
text = text.split()
text = [word2idx[w] for w in text]
text = np.array(text)
print(text)
text=text.reshape(1,text.shape[0])
max_len = 50
text = pad_sequences(maxlen=max_len, sequences=text, padding="post", value=num_words-1)
print('PREDICTION')
res = model.predict(text).argmax(axis=-1)[0]
print(res)

错误:

KeyError: 'AvcCC'

在我的数据集中,和vocab没有单词'AvcCC',如何处理?

我想在生产中使用该代码/模型。由于我的word2idx只包含起始数据中的单词,我如何处理不在我的word2idx词汇表中的单词呢?例如,我的word2idx词汇表不可能包含所有存在的名字和姓氏,也不可能包含所有城市/地点、所有公司名称、俚语等。

我的词汇表大约有40k个枚举词(这是我的数据集中唯一的词的数量)。然后,我用10万多个其他单词充实了它。(我做了一个网络爬虫,可以抓取不同类型的新闻文章)。所以现在,我的词汇量大约有14万个单词。现在,我不是从数据集中枚举唯一的单词,而是加载我的新word2idx/vocabulary。

word2idx = open('english-vocab.json')
word2idx = json.load(word2idx)
max_len = 50 
num_words = len(num_words) #number of unique words in dataset
X = [[word2idx[w[0]] for w in s] for s in list_of_sentances]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[label2idx[w[1]] for w in s] for s in list_of_sentances]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=label2idx["O"])
y = [to_categorical(i, num_classes=num_labels) for i in y]

准确性和损失保持不变,但由于Total参数(我不能再使用num_words,因为它显示错误,我需要使用len(word2idx)),我的模型变得更慢

input_word = Input(shape=(max_len,))
model = Embedding(input_dim = len(word2idx), output_dim = 50, input_length = max_len)(input_word)
model = SpatialDropout1D(0.2)(model)
model = Bidirectional(LSTM(units = 5, return_sequences=True, recurrent_dropout = 0.1))(model)
out = TimeDistributed(Dense(num_labels, activation = "softmax"))(model)
model = Model(input_word, out)
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         [(None, 30)]              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 30, 50)            5596600   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 30, 50)            0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 30, 10)            2240      
_________________________________________________________________
time_distributed_1 (TimeDist (None, 30, 11)            121       
=================================================================
Total params: 7,598,961 # MUCH BIGGER NUMBER
Trainable params: 5,598,961
Non-trainable params: 0

通过创建我自己的word2idx,我想处理词汇表中缺失的单词,但我所做的只是减慢了模型的训练速度。

我该如何处理这种问题?如何处理缺失/不存在/未知的单词?

在Patrick在评论部分提到的回答部分提到它,这对社区有益,也是处理"oov"的又一种方法。

text = [word2idx[w] for w in text] =>Text = [word2idx.get(w,"UNKNOWN_WORD") for w in text]会忽略键错误-所有未知单词将被"UNKNOWN_WORD"你可以添加为新的标签或者你可以做text = [word2idx[w] for w in text如果w inWord2idx]来消除所有不认识的单词。

下图是加载的预训练嵌入向量得到的300维嵌入矩阵。下面的矩阵将返回超出词汇表的所有零矩阵。

embedding_matrix = np.zeros((vocabulary_size,300))
for word,index in tokenizer.word_index.items():
if index > vocabulary_size -1:
break
else:
if word in index2word:
embedding_matrix[index] = pretrained_model[word]
else:
pass

最新更新