伯特型号 : "enable_padding() got an unexpected keyword argument 'max_length'"



我正在尝试使用Hugging Face和KERAS来实现BERT模型架构。我正在从Kaggle(链接(中学习并尝试理解它。当我标记我的数据时,我会遇到一些问题并收到错误消息。错误消息为:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-888a40c0160b> in <module>
----> 1 x_train = fast_encode(train1.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
2 x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
3 x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=MAX_LEN )
4 y_train = train1.toxic.values
5 y_valid = valid.toxic.values
<ipython-input-8-de591bf0a0b9> in fast_encode(texts, tokenizer, chunk_size, maxlen)
4     """
5     tokenizer.enable_truncation(max_length=maxlen)
----> 6     tokenizer.enable_padding(max_length=maxlen)
7     all_ids = []
8 
TypeError: enable_padding() got an unexpected keyword argument 'max_length'

代码为:

x_train = fast_encode(train1.comment_text.astype(str), fast_tokenizer, maxlen=192)
x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=192)
x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=192 )
y_train = train1.toxic.values
y_valid = valid.toxic.values

函数fast_encode在这里:

def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512):
"""
Encoder for encoding the text into sequence of integers for BERT Input
"""
tokenizer.enable_truncation(max_length=maxlen)
tokenizer.enable_padding(max_length=maxlen)
all_ids = []

for i in tqdm(range(0, len(texts), chunk_size)):
text_chunk = texts[i:i+chunk_size].tolist()
encs = tokenizer.encode_batch(text_chunk)
all_ids.extend([enc.ids for enc in encs])

return np.array(all_ids)

我现在该怎么办?

这里使用的标记器不是常规标记器,而是由Huggingfacetokenizer库的旧版本提供的快速标记器。

如果你想使用笔记本电脑上较旧版本的拥抱面transformers创建快速标记器,你可以这样做:

from tokenizers import BertWordPieceTokenizer
# First load the real tokenizer
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)
fast_tokenizer

然而,自从我编写了这段代码以来,使用快速标记器的过程变得非常简单。如果你看Huggingface的Proprocessing数据教程,你会注意到你只需要做:

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
batch_sentences = [
"Hello world",
"Some slightly longer sentence to trigger padding"
]
batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")

这是因为快速标记器(用Rust编写(在任何时候都会自动使用。

在enable_pedding函数中使用length而不是max_length。

电流:tokenizer.enable_padding(max_length=maxlen)

更改为:tokenizer.enable_padding(length=maxlen)

这会奏效的。

相关内容

最新更新