实现word2vec,但我得到错误,单词car_NOUN在词汇表中



我写了下面的代码:要在其上实现word2vec,现在我正在测试为w2v_model.wv['car_NOUN']进行嵌入,但我得到的错误如下:">单词'car_NOUN'不在词汇表中">,但我确定单词car_NOUN在词汇表中,有什么问题? 有人可以帮助我吗?

关于代码:我使用使用空间将推文中的单词限制为内容单词,即名词、动词和形容词。将单词转换为小写,并添加带有下划线的 POS。例如:love_VERB .然后我想在新列表中实现word2vec,但我想出了那个错误

love_VERB老fashioneds_NOUN

KeyError                                  Traceback (most recent call last)
<ipython-input-145-f6fb9c62175c> in <module>()
----> 1 w2v_model.wv['car_NOUN']
2 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
450             return result
451         else:
--> 452             raise KeyError("word '%s' not in vocabulary" % word)
453 
454     def get_vector(self, word):
KeyError: "word 'car_NOUN' not in vocabulary"
! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/reviews.full.tsv.zip'
wget.download(url, 'reviews.full.tsv.zip')

from zipfile import ZipFile
with ZipFile('reviews.full.tsv.zip', 'r') as zf:
zf.extractall()

import pandas as pd
df = pd.read_csv('reviews.full.tsv', sep='t', nrows=100000) # nrows , max amount of rows 
documents = df.text.values.tolist()
print(documents[:4])

import spacy
nlp = spacy.load('en_core_web_sm') #you can use other methods
# excluded tags
included_tags = {"NOUN", "VERB", "ADJ"}
#document = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]
sentences = documents[:103] #first 10 sentences
new_sentences = []
for sentence in sentences:
new_sentence = []
for token in nlp(sentence):
if token.pos_  in included_tags:
new_sentence.append(token.text.lower()+'_'+token.pos_)
new_sentences.append(" ".join(new_sentence))
def convert(new_sentences): 
return ' '.join(new_sentences).split() 
x=convert(new_sentences)

from gensim.models import Word2Vec
from gensim.models.word2vec import FAST_VERSION

# initialize model
w2v_model = Word2Vec(size=100,
window=15,
sample=0.0001,
iter=200,
negative=5, 
min_count=100,
workers=-1, 
hs=0
)
w2v_model.build_vocab(x)
w2v_model.train(x, 
total_examples=w2v_model.corpus_count, 
epochs=w2v_model.epochs)

w2v_model.wv['car_NOUN']

你的转换函数有一个错误:你应该将列表列表传递给 Word2Vec,就像一个包含列表中句子的列表一样。我已经为你改变了。基本上,你想像这样从某个人开始

['prices_NOUN',
'change_VERB',
'want_VERB',
'research_VERB',
'price_NOUN',
'many_ADJ',
'different_ADJ',
'sites_NOUN',
'found_VERB',
'cheaper_ADJ',]

对于这样的东西

[['prices_NOUN',
'change_VERB',
'want_VERB',]
['research_VERB',
'price_NOUN',
'many_ADJ',]
['different_ADJ',
'sites_NOUN',
'found_VERB',
'cheaper_ADJ',]]

我还对训练模型的代码进行了一些修改,以使其为我工作,您可能想尝试一下。

! pip install wget
from gensim.models.word2vec import FAST_VERSION
from gensim.models import Word2Vec
import spacy
import pandas as pd
from zipfile import ZipFile
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/reviews.full.tsv.zip'
wget.download(url, 'reviews.full.tsv.zip')
with ZipFile('reviews.full.tsv.zip', 'r') as zf:
zf.extractall()
# nrows , max amount of rows
df = pd.read_csv('reviews.full.tsv', sep='t', nrows=100000)
documents = df.text.values.tolist()
nlp = spacy.load('en_core_web_sm')  # you can use other methods
# excluded tags
included_tags = {"NOUN", "VERB", "ADJ"}
sentences = documents[:103]  # first 10 sentences
new_sentences = []
for sentence in sentences:
new_sentence = []
for token in nlp(sentence):
if token.pos_ in included_tags:
new_sentence.append(token.text.lower()+'_'+token.pos_)
new_sentences.append(new_sentence)

# initialize model
w2v_model = Word2Vec(new_sentences,
size=100,
window=15,
sample=0.0001,
iter=200,
negative=5,
min_count=1,  # <-- it seems your min_count was too high
workers=-1,
hs=0
)
w2v_model.wv['car_NOUN']

返回

array([ 3.4433445e-03, -4.6847924e-03, -4.6468928e-04, -4.1419661e-04,
1.6716495e-03, -1.3368594e-03,  2.3602389e-03, -3.5505681e-03,
-2.6509305e-04,  5.3194270e-04,  2.3251947e-03,  2.1161686e-03,
3.8566503e-03, -1.0463649e-03, -3.4403126e-04, -2.3808836e-03,
-1.7489052e-03, -3.6803843e-03, -5.5171514e-04, -4.3218122e-03,
3.2187223e-03, -1.4893038e-04, -4.7250376e-03, -3.9506676e-03,
4.9547744e-03,  6.8341813e-04, -1.7588978e-03,  2.9804371e-03,
1.4809771e-03,  3.8084502e-03,  3.7447066e-05, -2.6706287e-03,
-8.4727036e-04, -4.8435321e-03, -4.4348584e-03, -3.9350889e-03,
4.1925525e-03, -2.7435150e-03,  2.5154117e-03, -4.5825918e-03,
-3.8889556e-03,  4.0331958e-03, -5.7232054e-04,  1.7530264e-03,
3.8368679e-03, -3.4817799e-03,  2.4366400e-03, -3.7075430e-03,
-1.2156683e-03,  4.4666473e-03,  1.7927163e-05, -3.2169635e-03,
1.9718746e-03, -3.0671202e-03, -8.5452310e-04, -2.9490239e-03,
-4.1346985e-04,  8.5071824e-04,  4.4970238e-03, -2.8501134e-03,
4.4103153e-03,  1.4589783e-03,  3.6588225e-03, -1.4809598e-03,
-9.8118311e-05,  2.4781735e-03, -2.4647343e-03,  2.2115968e-03,
3.1630241e-03, -1.5672935e-04,  1.6695650e-03,  3.5689210e-03,
-2.6638571e-03,  3.4224256e-03, -1.5750986e-03,  3.6926002e-03,
3.2584099e-03,  3.8033908e-03,  1.5272110e-04, -2.2282582e-03,
-4.7118403e-04, -2.5838052e-03, -2.8910220e-03, -3.1307489e-03,
-4.0518055e-03, -2.3207215e-03,  1.2772443e-03, -4.4162138e-03,
-1.9835744e-03,  3.0219899e-03,  1.7312685e-03,  3.9408603e-03,
-5.6407665e-04,  3.2022693e-03, -8.9243404e-04,  4.5719477e-03,
4.7199172e-03, -4.9393933e-05,  2.2010114e-03, -3.4861618e-03],
dtype=float32)

最新更新