从标记化字符串中提取NLP相关模型的嵌入值

我正在使用huggingface管道来提取句子中的单词嵌入。据我所知，首先一个句子将被转换成一个标记字符串。我认为标记字符串的长度可能不等于原始句子中的单词数。我需要检索特定句子的词嵌入。

例如，下面是我的代码:

#https://discuss.huggingface.co/t/extracting-token-embeddings-from-pretrained-language-models/6834/6
from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np
import re
model_name = "xlnet-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
model_pipeline = pipeline('feature-extraction', model=model_name, tokenizer=tokenizer)
def find_wordNo_sentence(word, sentence):

print(sentence)
splitted_sen = sentence.split(" ")
print(splitted_sen)
index = splitted_sen.index(word)

for i,w in enumerate(splitted_sen):
if(word == w):
return i
print("not found") #0 base



def return_xlnet_embedding(word, sentence):

word = re.sub(r'[^w]', " ", word)
word = " ".join(word.split())

sentence = re.sub(r'[^w]', ' ', sentence)
sentence = " ".join(sentence.split())

id_word = find_wordNo_sentence(word, sentence)



try:
data = model_pipeline(sentence)

n_words = len(sentence.split(" "))
#print(sentence_emb.shape)
n_embs  = len(data[0])
print(n_embs, n_words)
print(len(data[0]))

if (n_words != n_embs):
"There is extra tokenized word"


results = data[0][id_word]  
return np.array(results)

except:
return "word not found"
return_xlnet_embedding('your', "what is your name?")

则输出为:

你叫什么名字("什么"、"是","你",'名字']6 4 6

因此，传递给管道的标记化字符串的长度比我的单词数多两倍。我如何找到(在这6个值中)哪一个是我的词的嵌入?

您可能知道，huggingface tokenizer包含频繁子词和完整子词。因此，如果你想为一些标记提取词嵌入，你应该考虑可能包含多个向量!此外，huggingface管道在第一步对输入句子进行编码，这将通过在开头&实际句尾

string = 'This is a test for clarification'
print(pipeline.tokenizer.tokenize(string))
print(pipeline.tokenizer.encode(string))

输出:

['this', 'is', 'a', 'test', 'for', 'cl', '##ari', '##fication']
[101, 2023, 2003, 1037, 3231, 2005, 18856, 8486, 10803, 102]

相关内容

最新更新

热门标签：