使用TensorFlow-GPU Python多处理时的错误

我使用TensorFlow-GPU Python多处理时注意到了一种奇怪的行为。

我已经实施了一个具有一些自定义和自己的数据集的DCGAN。由于我将DCGAN调整为某些功能，因此我有培训数据并测试数据以进行评估。

由于我的数据集的大小，我编写了同时运行的数据加载程序，并使用Python的多处理预装为队列。

代码的结构大致如下：

class ConcurrentLoader:
    def __init__(self, dataset):
        ...
class DCGAN
     ...
net = DCGAN()
training_data = ConcurrentLoader(path_to_training_data)
test_data = ConcurrentLoader(path_to_test_data)

此代码在TensorFlow-CPU 和上在TensorFlow-GPU＆lt; = 1.3.0上使用CUDA 8.0运行良好，但是当我使用 TensorFlow-GPU 1.4运行完全相同的代码时。1和CUDA 9 （TF＆amp; cuda的最新版本截至2017年12月）崩溃：

2017-12-20 01:15:39.524761: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.527795: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.529548: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.535341: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-12-20 01:15:39.535383: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-12-20 01:15:39.535397: F tensorflow/core/kernels/conv_ops.cc:667] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 
[1]    32299 abort (core dumped)  python dcgan.py --mode train --save_path ~/tf_run_dir/test --epochs 1

真正让我感到困惑的是，如果我仅删除test_data，则不会发生错误。因此，出于某种奇怪的原因，TensorFlow-GPU 1.4.1＆amp;CUDA 9仅与单个ConcurrentLoader一起工作，但是当多个加载程序初始化。

时崩溃

更有趣的是（例外之后）我必须手动关闭Python流程，因为GPU的VRAM，系统的RAM，甚至Python的过程在脚本崩溃后仍保持活力。

此外，它必须与Python的multiprocessing模块具有一些奇怪的连接，因为当我在Keras中实现相同模型时（使用TF Backend！）。我猜Keras在某种程度上创建了一层抽象，从而使TF崩溃。

我可能在哪里可以用它导致这样的崩溃的multiprocessing模块搞砸了？

这些是在ConcurrentLoader中使用multiprocessing的代码的一部分：

def __init__(self, dataset):
    ...
    self._q = mp.Queue(64)
    self._file_cycler = cycle(img_files)
    self._worker = mp.Process(target=self._worker_func, daemon=True)
    self._worker.start()
def _worker_func(self):
    while True:
        ... # gets next filepaths from self._file_cycler
        buffer = list()
        for im_path in paths:
            ... # uses OpenCV to load each image & puts it into the buffer
        self._q.put(np.array(buffer).astype(np.float32))

...就是这样。

我在哪里写了"不稳定"或"非pythonic" multiprocessing代码？我认为daemon=True应该确保在主要过程死亡后尽快杀死每个过程？不幸的是，此特定错误并非如此。

我是否在此处滥用默认的multiprocessing.Process或multiprocessing.Queue？我认为简单地写一堂课，我将图像存储在队列中，并通过方法/实例变量可以访问它。

我在尝试使用TensorFlow和多处理

时会出现相同的错误

E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED

但是在不同的环境中TF1.4 CUDA 8.0 CUDNN 6.0。示例代码中的矩阵mubcublas效果很好。我也想知道正确的解决方案！引用未能创建Cublas句柄：cublas_status_not_initialized在AWS p2.xlarge实例上不适合我。

相关内容

最新更新

热门标签：