Bert tokenizer不能使用张量格式(tensorflow)

这可能是一个愚蠢的问题，但我刚开始使用tf。我有以下代码，但标记器不会使用张量中的字符串。

import tensorflow as tf
docs = tf.data.Dataset.from_tensor_slices([['hagamos que esto funcione.'], ["por fin funciona!"]])
from transformers import AutoTokenizer, DataCollatorWithPadding
import numpy as np
checkpoint = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize(review):
return tokenizer(review)

tokens = docs.map(tokenize)

得到以下输出:

ValueError: in user code:
File "<ipython-input-54-3272cedfdcab>", line 13, in tokenize  *
return tokenizer(review)
File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py", line 2429, in __call__  *
raise ValueError(
ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

而我期望的输出是这样的:

tokenizer('esto al fin funciona!')
{'input_ids': [4, 1202, 1074, 1346, 4971, 1109, 5], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

你知道怎么让它工作吗?

如错误中所述，您必须将输入作为字符串，list(str)或list(list(str))传递给tokenzier。

请检查下面的工作代码。

import tensorflow as tf
docs = ['hagamos que esto funcione.', "por fin funciona!"]
from transformers import AutoTokenizer, DataCollatorWithPadding
checkpoint = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize(review):
return tokenizer(review)
tokens = tokenizer(docs)

上面代码的输出是:

{'input_ids': [[4, 8700, 1041, 1202, 13460, 1008, 5], [4, 1076, 1346, 4971, 1109, 5]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

相关内容

最新更新

热门标签：