我想使用keras sklearn包装器创建一个sklearn
管道。我正在尝试使用aclimdb(也称为大型电影数据集)进行情感分类任务,我已将其转换为两列的pandas数据帧,一列用于评论(字符串),另一列用于标签(整数)。
> df.head(4)
review sentiment
0 "Lifeforce" is a truly bizarre adaptation of t... 1
1 I ordered this movie on the Internet as it is ... 0
2 he was my hero for all time until he went alon... 0
3 This is a 'sleeper'. It defines Nicholas Cage.... 1
我有一个管道,它使用CountVectorizer
标记评论,使用TfidfTransformer
应用tfidf转换,然后使用KerasClassifier
和下面的model
函数拟合二进制分类模型:
X_train = df.loc[1:25000, "review"]
y_train = df.loc[1:25000, 'sentiment'].values
X_test = df.loc[25000:, "review"]
y_test = df.loc[25000:, 'sentiment'].values
np.random.seed(123) # for reproducibility
def model():
model = models.Sequential([
layers.Dense(16, input_shape = (10**4,), activation='relu'),
layers.Dropout(0.5),
layers.Dense(16, activation='relu'),
layers.Dropout(0.5),
layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='rmsprop',
metrics=['accuracy'])
return model
early_stopping = callbacks.EarlyStopping(monitor='val_loss', patience=1, verbose=0, mode='auto')
pipe = pipeline.Pipeline([
('vect', CountVectorizer(max_features=10**4)),
('tfidf', TfidfTransformer()),
('nn', KerasClassifier(build_fn=model,
nb_epoch=10, batch_size=128,
validation_split=0.2, callbacks=[early_stopping]))
])
为了实现这一点,我必须为keras模型指定input_shape
,这意味着我必须固定CountVectorizer
的max_features
的值。我不想这样做。
有没有一种方法可以从上一个管道阶段(在本例中为TfidfTransformer
)获得输出的维度,并将其传递给KerasClassifier
?例如,类似这样的东西:
def model(input_df):
model = models.Sequential([
layers.Dense(16, input_shape = input_df.shape, activation='relu'),
layers.Dropout(0.5),
layers.Dense(16, activation='relu'),
layers.Dropout(0.5),
layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='rmsprop',
metrics=['accuracy'])
return model
early_stopping = callbacks.EarlyStopping(monitor='val_loss', patience=1, verbose=0, mode='auto')
pipe = pipeline.Pipeline([
# ('vect', CountVectorizer(max_features=10**4)),
# ('tfidf', TfidfTransformer()),
('tfidf', TfidfVectorizer(max_features=10**4)),
('nn', KerasClassifier(build_fn=model(input_df=tfidf),
nb_epoch=10, batch_size=128,
validation_split=0.2, callbacks=[early_stopping]))
])
## train network pipeline
pipe.fit(X_train.values, y_train)
-------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-21be14eb185d> in <module>()
19 # ('tfidf', TfidfTransformer()),
20 ('tfidf', TfidfVectorizer(max_features=10**4)),
---> 21 ('nn', KerasClassifier(build_fn=model(input_df=tfidf),
22 nb_epoch=10, batch_size=128,
23 validation_split=0.2, callbacks=[early_stopping]))
NameError: name 'tfidf' is not defined
我可以将管道分成两个步骤,然后保存两个转换器的输出数据帧,在那里我可以很容易地捕捉形状,但我宁愿一次性完成。
系统信息:
print(platform.platform())
print("Python", sys.version)
print("NumPy", np.__version__)
print("SciPy", scipy.__version__)
print("Scikit-Learn", sklearn.__version__)
print("Keras Backend", os.getenv("KERAS_BACKEND")) # doesn't work with tf https://github.com/fchollet/keras/issues/4984
Linux-4.4.0-91-generic-x86_64-with-debian-stretch-sid
Python 3.5.3 |Anaconda custom (64-bit)| (default, Mar 6 2017, 11:58:13)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.13.3
SciPy 0.19.1
Scikit-Learn 0.19.0
Keras Backend cntk
谢谢!
为了解决这个问题,您必须:
-
从模型中删除input_shape
-
为sklearnpipeline 定义自定义ArrayTransformer
- 在tfidf/counter和keras模型之间插入这个新的转换器
在您的代码中:
def model():
model = models.Sequential([
layers.Dense(16, activation='relu'),
layers.Dropout(0.5),
layers.Dense(16, activation='relu'),
layers.Dropout(0.5),
layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='rmsprop',
metrics=['accuracy'])
return model
class ArrayTransformer():
def transform(self, X, **transform_params):
return X.toarray()
def fit(self, X, y=None, **fit_params):
return self
early_stopping = callbacks.EarlyStopping(monitor='val_loss', patience=1, verbose=0,
mode='auto')
pipe = pipeline.Pipeline([
('tfidf', TfidfVectorizer(max_features=XXX)),
('transformer', ArrayTransformer()),
('nn', KerasClassifier(build_fn=model,
nb_epoch=10, batch_size=128,
validation_split=0.2, callbacks=[early_stopping]))
])
pipe.fit(X_train.values, y_train)
通过这种方式,您还可以将tfidf/counter与GridSearchCV相结合,并调整min_df、max_features。。。