我正在使用张量流式:
0.12.1
cuda工具集版本是8。
lrwxrwxrwx 1 root root 19 May 28 17:27 cuda -> /usr/local/cuda-8.0
如下所示,我已经下载并安装了Cudnn。但是,在我的python脚本中执行以下行时,我正在收到标题中提到的错误消息:
model.fit_generator(train_generator,
steps_per_epoch= len(train_samples),
validation_data=validation_generator,
validation_steps=len(validation_samples),
epochs=9)
详细的错误消息如下:
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Epoch 1/9 Exception in thread Thread-1: Traceback (most recent call last): File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run() File " lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
generator_output = next(self._generator) StopIteration
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),
but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885]
Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 3.94GiB Free memory:
3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975]
Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Traceback (most recent call last): File "model_new.py", line 82, in <module>
model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
initial_epoch=initial_epoch) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
class_weight=class_weight) File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
outputs = self.train_function(ins) File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
feed_dict=feed_dict) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype) File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order) MemoryError
如果有任何解决此错误的建议。
编辑:问题是致命的。
uname -a
Linux ip-172-31-76-109 4.4.0-78-generic #99-Ubuntu SMP
Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
sudo lshw -short
[sudo] password for carnd:
H/W path Device Class Description
==========================================
system HVM domU
/0 bus Motherboard
/0/0 memory 96KiB BIOS
/0/401 processor Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
/0/402 processor CPU
/0/403 processor CPU
/0/404 processor CPU
/0/405 processor CPU
/0/406 processor CPU
/0/407 processor CPU
/0/408 processor CPU
/0/1000 memory 15GiB System Memory
/0/1000/0 memory 15GiB DIMM RAM
/0/100 bridge 440FX - 82441FX PMC [Natoma]
/0/100/1 bridge 82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.1 storage 82371SB PIIX3 IDE [Natoma/Triton II]
/0/100/1.3 bridge 82371AB/EB/MB PIIX4 ACPI
/0/100/2 display GD 5446
/0/100/3 display GK104GL [GRID K520]
/0/100/1f generic Xen Platform Device
/1 eth0 network Ethernet interface
编辑2:
这是Amazon Cloud中的EC2实例。以及所有保存值-1的文件。
:/sys$ find . -name numa_node -exec cat '{}' ;
find: ‘./fs/fuse/connections/39’: Permission denied
-1
-1
-1
-1
-1
-1
-1
find: ‘./kernel/debug’: Permission denied
edit3:更新后,NUMA_NOD文件NUMA相关错误消失了。但是上面列出的所有其他以前的错误都存在。再次我有致命的错误。
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Epoch 1/9
Exception in thread Thread-1:
Traceback (most recent call last):
File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File " lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
generator_output = next(self._generator)
StopIteration
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Traceback (most recent call last):
File "model_new.py", line 85, in <module>
model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)
File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
initial_epoch=initial_epoch)
File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
class_weight=class_weight)
File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
outputs = self.train_function(ins)
File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
feed_dict=feed_dict)
File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order)
MemoryError
有一个代码打印消息"从sysfs读取成功的numa node读取为负值(-1(",这不是致命错误,只是警告。实际错误是您的File "model_new.py", line 85, in <module>
中的MemoryError
。我们需要更多来源来检查此错误。尝试使模型较小或使用更多RAM在服务器上运行。
关于numa节点警告:
https://github.com/tensorflow/tensorflow/blob/e4296aeff97e6edd3d7cee9a09b9b9dd77da4c034/tensorflow/tensorflow/stream_executor/cuda/cuda/cuda_gpu_gpu_executor.ccccccccccut.cccut.cccut.cccut.ccc.ut.ccc.ut.cccut,ccc,utl855
// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out
// of SysFS. Returns -1 if it cannot...
static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal)
{...
string filename =
port::Printf("/sys/bus/pci/devices/%s/numa_node", pci_bus_id.c_str());
FILE *file = fopen(filename.c_str(), "r");
if (file == nullptr) {
LOG(ERROR) << "could not open file to read NUMA node: " << filename
<< "nYour kernel may have been built without NUMA support.";
return kUnknownNumaNode;
} ...
if (port::safe_strto32(content, &value)) {
if (value < 0) { // See http://b/18228951 for details on this path.
LOG(INFO) << "successful NUMA node read from SysFS had negative value ("
<< value << "), but there must be at least one NUMA node"
", so returning NUMA node zero";
fclose(file);
return 0;
}
TensorFlow能够打开/sys/bus/pci/devices/%s/numa_node
文件,其中%s是GPU PCI卡的ID(string pci_bus_id = CUDADriver::GetPCIBusID(device_)
(。您的PC不是多功能,仅安装了8核Xeon E5-2670的单个CPU插座,因此该ID应该为'0'(单个NUMA节点在Linux中为0(,但是错误消息说是 -1
该文件中的值!
因此,我们知道SYSFS已安装到/sys
中,在Linux内核配置(zgrep NUMA /boot/config* /proc/config*
(中启用了numa_node
特殊文件,config_numa已启用。实际上它已启用:CONFIG_NUMA=y
-在您的X86_64 4.4.0-78代内核的DEB中
特殊文件numa_node
在https://www.kernel.org/doc/documentation/abi/testing/sysfs-bus-pci(>您的PC的ACPI是错误的?(
What: /sys/bus/pci/devices/.../numa_node
Date: Oct 2014
Contact: Prarit Bhargava <prarit@redhat.com>
Description:
This file contains the NUMA node to which the PCI device is
attached, or -1 if the node is unknown. The initial value
comes from an ACPI _PXM method or a similar firmware
source. If that is missing or incorrect, this file can be
written to override the node. In that case, please report
a firmware bug to the system vendor. Writing to this file
taints the kernel with TAINT_FIRMWARE_WORKAROUND, which
reduces the supportability of your system.
有此错误的快速(kludge(解决方法:查找您的GPU的numa_node
,并且使用root帐户在每个启动之后进行此命令,其中nnnnn是您的卡的PCI ID(在lspci
输出和/sys/bus/pci/devices/
目录中搜索(/p>
echo 0 | sudo tee -a /sys/bus/pci/devices/NNNNN/numa_node
或只是回音到每个此类文件中,它应该相当安全:
for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/numa_node; done
您的lshw
还表明它不是PC,而是Xen虚拟访客。XEN平台(ACPI(和Linux PCI BUS NUMA-SUPPORT代码之间有问题。
这修改了接受的答案:
令人讨厌的是,每次重新启动系统时,numa_node
设置是重置(值-1(。要更持久地修复此问题,您可以创建一个crontab(作为root(。
以下步骤对我有用:
# 1) Identify the PCI-ID (with domain) of your GPU
# For example: PCI_ID="0000.81:00.0"
lspci -D | grep NVIDIA
# 2) Add a crontab for root
sudo crontab -e
# Add the following line
@reboot (echo 0 | tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node")
这可以确保每个重启上的GPU设备的Numa亲和力设置为0。
再次,请记住,这只是"浅"。由于NVIDIA驱动程序没有意识到它:
nvidia-smi topo -m
# GPU0 CPU Affinity NUMA Affinity
# GPU0 X 0-127 N/A
哇!非常感谢此信息@normanius。这是对我系统上有用的唯一解决方案(正在使用其他解决方案获得"仅读取文件系统错误"(。这是我使用的脚本(不是用作cron作业,而是作为'/etc/local.d/numa_node.start'bash脚本,用于与OpenRC一起使用Sane non-Systemd Linux操作系统。
(。#!/bin/bash
for pcidev in $(lspci -D|grep 'VGA compatible controller: NVIDIA'|sed -e 's/[[:space:]].*//'); do echo 0 > /sys/bus/pci/devices/${pcidev}/numa_node; done
不需要numa_node.stop脚本,因为...重新启动后重置。
可以通过从任何良好的Linux Bash提示中运行诸如'Man Sed'之类的内容或通过阅读在线Manpage资源(例如:https:/https:/https://linux.die.net/man/1/sed