我正在尝试加载一个自定义数据集,然后将其用于语言建模。数据集由一个文本文件组成,该文件每行都有一个完整的文档,这意味着每行都超过了大多数标记器的512个标记的正常限制。
我想了解构建一个标记每一行的文本数据集的过程是什么,之前我已经将数据集中的文档分割成了一行";可标记的";大小,就像旧的TextDataset类所做的那样,您只需要执行以下操作,并且可以将没有文本丢失的标记化数据集传递给DataCollator:
model_checkpoint = 'distilbert-base-uncased'
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
from transformers import TextDataset
dataset = TextDataset(
tokenizer=tokenizer,
file_path="path/to/text_file.txt",
block_size=512,
)
我希望使用数据集库,而不是这种即将被弃用的方式。目前,我有以下内容,当然,这会引发错误,因为每一行都比标记器中的最大块大小长:
import datasets
dataset = datasets.load_dataset('path/to/text_file.txt')
model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
def tokenize_function(examples):
return tokenizer(examples["text"])
tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
那么";标准";以以前的方式创建数据集,但使用数据集库?
非常感谢您的帮助:(
我在HuggingFace数据集论坛上收到了@lhoestq 对此问题的回答
嗨!
如果你想逐行标记化,你可以使用这个:
max_seq_length = 512 num_proc = 4 def tokenize_function(examples): # Remove empty lines examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()] return tokenizer( examples["text"], truncation=True, max_length=max_seq_length, ) tokenized_dataset = dataset.map( tokenize_function, batched=True, num_proc=num_proc, remove_columns=["text"], )
尽管TextDataset通过级联大小为512的所有文本和构建块。如果你需要此行为,则必须应用额外的映射函数标记化后:
# Main data processing function that will concatenate all texts from # our dataset and generate chunks of max_seq_length. def group_texts(examples): # Concatenate all texts. concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) # We drop the small remainder, we could add padding if the model supported it instead of this drop, # you can customize this part to your needs. total_length = (total_length // max_seq_length) * max_seq_length # Split by chunks of max_len. result = { k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)] for k, t in concatenated_examples.items() } return result # Note that with `batched=True`, this map processes 1,000 texts together, # so group_texts throws away a remainder for each of those groups of 1,000 texts. # You can adjust that batch_size here but a higher value might be slower to preprocess. tokenized_dataset = tokenized_dataset.map( group_texts, batched=True, num_proc=num_proc, )
此代码来自对run_mlm.py示例脚本的处理变压器