系统规范
- Ubuntu 18.04服务器
- 已安装GPU:Nvidia P1000
- Cuda版本:Cuda 10.1.243版本
- Tensorflow:
tensorflow-gpu==1.15.
我注意到一个非常奇怪的错误,GPU只能在Python进程树的根进程中用于Tensorflow。如果我使用multiprocessing.Process()
分叉进程,则GPU不再可用
样本代码:
import tensorflow as tf
import multiprocessing
import os
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def main():
logger.info("main(): tf.test.is_gpu_available(): %s", tf.test.is_gpu_available())
process = multiprocessing.Process(target=run_tensorflow, args=())
process.daemon = False
process.start()
def run_tensorflow():
logger.info("main(): tf.test.is_gpu_available(): %s", tf.test.is_gpu_available())
if __name__ == '__main__':
main()
输出
2020-04-17 05:01:37.834131: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-17 05:01:37.855703: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2020-04-17 05:01:37.856170: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55bb442b0560 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-17 05:01:37.856184: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-04-17 05:01:37.857492: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-17 05:01:37.940480: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:01:37.940856: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55bb44337c50 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-17 05:01:37.940872: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1660 SUPER, Compute Capability 7.5
2020-04-17 05:01:37.940974: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:01:37.941214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1660 SUPER major: 7 minor: 5 memoryClockRate(GHz): 1.785
pciBusID: 0000:01:00.0
2020-04-17 05:01:37.941410: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-17 05:01:37.942234: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-04-17 05:01:37.942998: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-04-17 05:01:37.943193: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-04-17 05:01:37.944143: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-04-17 05:01:37.944915: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-04-17 05:01:37.947293: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-17 05:01:37.947399: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:01:37.947708: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:01:37.947945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-04-17 05:01:37.947970: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-17 05:01:37.948442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-17 05:01:37.948452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-04-17 05:01:37.948457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-04-17 05:01:37.948548: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:01:37.948813: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:01:37.949069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 5450 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:__main__:main(): tf.test.is_gpu_available(): True
2020-04-17 05:01:37.954340: E tensorflow/stream_executor/cuda/cuda_driver.cc:1247] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error
2020-04-17 05:01:37.954384: E tensorflow/stream_executor/cuda/cuda_driver.cc:1247] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error
INFO:__main__:main(): tf.test.is_gpu_available(): False
重要的部分(我认为(是
INFO:__main__:main(): tf.test.is_gpu_available(): True
首先是
INFO:__main__:run_tensorflow(): tf.test.is_gpu_available(): False
为什么我不能从子进程获得GPU的句柄?
编辑:如果我等待导入tensorflow,直到我分叉处理之后,我才能看到GPU,这可能会很有用
import multiprocessing
import os
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def main():
#logger.info("main(): tf.test.is_gpu_available(): %s", tf.test.is_gpu_available())
process = multiprocessing.Process(target=run_tensorflow, args=())
process.daemon = False
process.start()
def run_tensorflow():
import tensorflow as tf
logger.info("main(): tf.test.is_gpu_available(): %s", tf.test.is_gpu_available())
if __name__ == '__main__':
main()
2020-04-17 05:08:25.256372: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-17 05:08:25.279630: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2020-04-17 05:08:25.280028: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5606fe0d0170 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-17 05:08:25.280047: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-04-17 05:08:25.281970: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-17 05:08:25.370354: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:08:25.370696: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5606fe157820 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-17 05:08:25.370713: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1660 SUPER, Compute Capability 7.5
2020-04-17 05:08:25.370815: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:08:25.371047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1660 SUPER major: 7 minor: 5 memoryClockRate(GHz): 1.785
pciBusID: 0000:01:00.0
2020-04-17 05:08:25.371225: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-17 05:08:25.372088: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-04-17 05:08:25.372890: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-04-17 05:08:25.373070: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-04-17 05:08:25.374055: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-04-17 05:08:25.374872: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-04-17 05:08:25.377440: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-17 05:08:25.377538: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:08:25.377835: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:08:25.378052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-04-17 05:08:25.378082: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-17 05:08:25.378552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-17 05:08:25.378564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-04-17 05:08:25.378569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-04-17 05:08:25.378638: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:08:25.378883: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-17 05:08:25.379117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 5450 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:__main__:main(): tf.test.is_gpu_available(): True
Tensorflow默认情况下对GPU内存分配是贪婪的。限制GPU内存增长描述了几种限制GPU分配的方法。这应该允许多个Tensorflow程序共享一个GPU。然而,我完全不知道Tensorflow是如何处理fork((的,尤其是当GPU已经处于活动状态时,我很难相信它能正常工作。也许在导入Tensorflow(或者至少使用它(之前使用fork((?