我在pandas数据帧中有一个标记的数据集。
>>> df.dtypes
title object
headline object
byline object
dateline object
text object
copyright category
country category
industry category
topic category
file object
dtype: object
我在text
的基础上建立了一个预测topic
的模型。虽然text
是一个大字符串,但topic
是一个字符串列表。例如:
>>> df['topic'].head(5)
0 ['ECONOMIC PERFORMANCE', 'ECONOMICS', 'EQUITY ...
1 ['CAPACITY/FACILITIES', 'CORPORATE/INDUSTRIAL']
2 ['PERFORMANCE', 'ACCOUNTS/EARNINGS', 'CORPORAT...
3 ['PERFORMANCE', 'ACCOUNTS/EARNINGS', 'CORPORAT...
4 ['STRATEGY/PLANS', 'NEW PRODUCTS/SERVICES', 'C...
在我将其放入模型之前,我必须对整个数据帧进行标记,但当通过transformer的Autotokenizer
运行它时,我会得到一个错误。
import pandas as pd
import numpy as np
import tensorflow as tf
from transformers import AutoTokenizer
import tensorflow_hub as hub
import tensorflow_text as text
from sklearn.model_selection import train_test_split
def preprocess_text(df):
# Remove punctuations and numbers
df['text'] = df['text'].str.replace('[^a-zA-Z]', ' ', regex=True)
# Single character removal
df['text'] = df['text'].str.replace(r"s+[a-zA-Z]s+", ' ', regex=True)
# Removing multiple spaces
df['text'] = df['text'].str.replace(r's+', ' ', regex=True)
# Remove NaNs
df['text'] = df['text'].fillna('')
df['topic'] = df['topic'].cat.add_categories('').fillna('')
return df
# Load tokenizer and logger
tf.get_logger().setLevel('ERROR')
tokenizer = AutoTokenizer.from_pretrained('roberta-large')
# Load dataframe with just text and topic columns
# Only loading first 100 rows for testing purposes
df = pd.DataFrame()
for chunk in pd.read_csv(r'Reuterstest.csv', sep='|', chunksize=100,
dtype={'topic': 'category', 'country': 'category', 'industry': 'category', 'copyright': 'category'}):
df = chunk
break
df = preprocess_text(df)
# Split dataset into train, test, val (70, 15, 15)
train, test = train_test_split(df, test_size=0.15)
train, val = train_test_split(train, test_size=0.15)
# Tokenize datasets
train = tokenizer(train, return_tensors='tf', truncation=True, padding=True, max_length=128)
val = tokenizer(val, return_tensors='tf', truncation=True, padding=True, max_length=128)
test = tokenizer(test, return_tensors='tf', truncation=True, padding=True, max_length=128)
我得到这个错误:
AssertionError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
在线路CCD_ 6上。
这是否意味着我必须把我的df变成一个列表?
简而言之,是的。您也不想标记整个列,而只想标记文本列的numpy数组。缺少的步骤如下所示。
# Create new index
train_idx = [i for i in range(len(train.index))]
test_idx = [i for i in range(len(test.index))]
val_idx = [i for i in range(len(val.index))]
# Convert to numpy
x_train = train['text'].values[train_idx]
x_test = test['text'].values[test_idx]
x_val = val['text'].values[val_idx]
y_train = train['topic_encoded'].values[train_idx]
y_test = test['topic_encoded'].values[test_idx]
y_val = val['topic_encoded'].values[val_idx]
# Tokenize datasets
tr_tok = tokenizer(list(x_train), return_tensors='tf', truncation=True, padding=True, max_length=128)
val_tok = tokenizer(list(x_val), return_tensors='tf', truncation=True, padding=True, max_length=128)
test_tok = tokenizer(list(x_test), return_tensors='tf', truncation=True, padding=True, max_length=128)