我正在尝试使用Hugging Face和KERAS来实现BERT模型架构。我正在从Kaggle(链接(中学习并尝试理解它。当我标记我的数据时,我会遇到一些问题并收到错误消息。错误消息为:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-888a40c0160b> in <module>
----> 1 x_train = fast_encode(train1.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
2 x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
3 x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=MAX_LEN )
4 y_train = train1.toxic.values
5 y_valid = valid.toxic.values
<ipython-input-8-de591bf0a0b9> in fast_encode(texts, tokenizer, chunk_size, maxlen)
4 """
5 tokenizer.enable_truncation(max_length=maxlen)
----> 6 tokenizer.enable_padding(max_length=maxlen)
7 all_ids = []
8
TypeError: enable_padding() got an unexpected keyword argument 'max_length'
代码为:
x_train = fast_encode(train1.comment_text.astype(str), fast_tokenizer, maxlen=192)
x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=192)
x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=192 )
y_train = train1.toxic.values
y_valid = valid.toxic.values
函数fast_encode在这里:
def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512):
"""
Encoder for encoding the text into sequence of integers for BERT Input
"""
tokenizer.enable_truncation(max_length=maxlen)
tokenizer.enable_padding(max_length=maxlen)
all_ids = []
for i in tqdm(range(0, len(texts), chunk_size)):
text_chunk = texts[i:i+chunk_size].tolist()
encs = tokenizer.encode_batch(text_chunk)
all_ids.extend([enc.ids for enc in encs])
return np.array(all_ids)
我现在该怎么办?
这里使用的标记器不是常规标记器,而是由Huggingfacetokenizer
库的旧版本提供的快速标记器。
如果你想使用笔记本电脑上较旧版本的拥抱面transformers
创建快速标记器,你可以这样做:
from tokenizers import BertWordPieceTokenizer
# First load the real tokenizer
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)
fast_tokenizer
然而,自从我编写了这段代码以来,使用快速标记器的过程变得非常简单。如果你看Huggingface的Proprocessing数据教程,你会注意到你只需要做:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
batch_sentences = [
"Hello world",
"Some slightly longer sentence to trigger padding"
]
batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
这是因为快速标记器(用Rust编写(在任何时候都会自动使用。
在enable_pedding函数中使用length而不是max_length。
电流:tokenizer.enable_padding(max_length=maxlen)
更改为:tokenizer.enable_padding(length=maxlen)
这会奏效的。