TensorFlow Dataset API:使用填充批处理的 NumPy 数组



快速免责声明:这是我第一次在这里主动提出有关堆栈溢出的问题。

现在谈谈问题本身。使用张量流 1.4 的相当新的数据集 API 以及从 numpy 数组和填充批处理读取可变长度输入时,我遇到了一些问题。

根据官方文档(https://www.tensorflow.org/programmers_guide/datasets#consuming_numpy_arrays),使用数组作为输入既受支持又简单明了。现在的关键是,必须先将数据输入 tensorflow 占位符,然后才能将数据集对象的 padded_batch 方法应用于数据。然而,可变长度输入的numpy表示不是对称的,因此被解释为序列而不是数组。但是,提供一种padded_batch方法,使一系列可变长度的输入可以由数据集处理,这难道不是全部原因吗?长话短说,你们中有没有人经历过类似的情况并找到了解决方案?非常感谢您的帮助!

以下是一些可能有助于更好地理解问题的代码片段。

输入如下所示:

array([array([65,  3, 96, 94], dtype=int32), array([88], dtype=int32),
array([113,  52, 106,  57,   3,  86], dtype=int32),
array([88,  3, 23, 91], dtype=int32), ... ])

填充数据集定义的实际代码片段:


for fold, (train_idx, dev_idx) in enumerate(sss.split(X, y)):

X_train = X[train_idx]
y_train = y[train_idx]
X_dev = X[dev_idx]
y_dev = y[dev_idx]
tf.reset_default_graph()
with tf.Session() as sess:
features_placeholder = tf.placeholder(tf.int32, [None, None], name='input_x')
labels_placeholder = tf.placeholder(tf.int32, [None, num_classes], name='input_y')
dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
dataset = dataset.shuffle(buffer_size=len(train_idx))
dataset = dataset.padded_batch(batch_size, padded_shapes=([None], [None]), padding_values=(1, 0))
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
sess.run(iterator.initializer, feed_dict={features_placeholder: np.array(X_train),
labels_placeholder: np.array(y_train)})

最后,来自 jupyter 笔记本的相应堆栈跟踪:


ValueError                                Traceback (most recent call last)
in ()
---->1 cnn.train2(X_idx, y_bin, n_splits=5)

in train2(self, X, y, n_splits) 480 481 self.session.run(iterator.initializer, feed_dict={features_placeholder: np.array(X_train), -->482 labels_placeholder: np.array(y_train)}) 483 # self.session.run(iterator.initializer) 484

~/.virtualenvs/ravenclaw/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata) 887 try: 888 result = self._run(None, fetches, feed_dict, options_ptr, -->889 run_metadata_ptr) 890 if run_metadata: 891 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/.virtualenvs/ravenclaw/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata) 1087 feed_handles[subfeed_t] = subfeed_val 1088 else: ->1089 np_val = np.asarray(subfeed_val, dtype=subfeed_dtype) 1090 1091 if (not is_tensor_handle_feed and

~/.virtualenvs/ravenclaw/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 490 491 """ -->492 return array(a, dtype, copy=False, order=order) 493 494

ValueError: setting an array element with a sequence.

Thanks again for your support.

I had the same problem until I stumbled across this link here from tensorflow issue thread. Apparently the workaround to deal with varied length inputs is to useDataset.from_generator, as according to the link. The API in question is here. Since I have input and labels vector as well, what I do is to use zip function to iterate both input and labels:

zipped = list(zip(x_train,y_train))
dataset = tf.data.Dataset.from_generator(lambda: zipped, (tf.int32, tf.int32))
dataset = dataset.padded_batch(batch_size, padded_shapes=([None], [None]), padding_values=(1, 0))
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

最新更新