在Ubuntu 20.04上安装Tensorflow 2.4，带有GPU，无需sudo

我可以使用Ubuntu 20.04设置和GPU访问虚拟机。系统管理员已经安装了最新的Cuda驱动程序，但不幸的是，这还不足以在Tensorflow中使用GPU，因为当涉及到特定的Cuda Toolkit+CuDNN版本时，每个版本的TF都可能非常挑剔。我没有sudo权限，所以我需要在本地安装所有内容。

nvidia-smi

return驱动程序版本：465.19.01 CUDA版本：11.3

python -c "import tensorflow as tf, logging; logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s'); tf.config.list_physical_devices('GPU');"

2021-05-11 10:56:26.7737279:W tensorflow/stream_executor/platform/default/ddso_loader.cc:60]无法加载动态库'libcudart.so.11.0'；dlerror:libcudart.so.11.0:无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10:56:26.7737338:I tensorflow/stream_executor/cuda/cudart_stub.cc:29]如果您的计算机上没有设置GPU，则忽略上面的cudart dlerror
2021-05-11 10:56:28.313896:I tensorflow/co编译器/jit/xla_cpu_device.cc:41]未创建xla设备，tf_xla_enable_xla_devices未设置
2021-05-11 10:56:28.315540:I tensorflow/stream_executor/platform/default/ddso_loader.cc:49]成功打开动态库libcuda.so.1
2022-05-11 10:56-28.324232:I tensorflow/stream-executor/cuda/cuda_gpu_executor.cc:941]从SysFS读取的成功NUMA节点具有负值(-1)，但必须至少有一个NUMA节点，因此返回NUMA节点零
2021-05-11 10:56:28.324707:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720]找到具有属性的设备0：
pciBusID:0000:00:05.0名称：NVIDIA Tesla P100-PCIE-12GB计算机能力：6.0
核心时钟：1.3285GHz核心计数：56设备内存大小：11.91GiB设备内存带宽：511.41GiB/stensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941]从SysFS读取的成功NUMA节点具有负值(-1)，但必须至少有一个NUMA节点，因此返回NUMA节点零
2021-05-11 10:56:28.35293:I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720]找到具有属性的设备1：
pciBusID:0000:00:06.0名称：NVIDIA Tesla P100-PCIE-12GB计算机能力：6.0
核心时钟：1.3285GHz核心计数：56设备内存大小：11.91GiB设备内存带宽：511.41GiB/s
2022-05-11 10:56-28.325438:Wtensorflow/stream_executor/platform/default/dso_loader.cc:60]无法加载动态库"libcudart.so.11.0"；dlerror:libcudart.so.11.0：无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10:56:28.325663:W tensorflow/stream_executor/platform/default/dso_loader.cc:60]无法加载动态库'libcublas.so.11'；dleerror:libcublas.so.11:无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10:56:28.325006:W tensorflow/stream_executor/platform/default/ddso_loader.cc:60]无法加载动态库'libcublasLt.so.11'；dlerror:libcublasLt.so.11:无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10:56:28.325820:W tensorflow/stream_executor/platform/default/ddso_loader.cc:60]无法加载动态库'libcufft.so.10'；dleerror:libcufft.so.10:无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10:56:28.325931:W tensorflow/stream_executor/platform/default/ddso_loader.cc:60]无法加载动态库'libcurand.so.10'；dleerror:libcurand.so.10:无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10:56:28.326028:W tensorflow/stream_executor/platform/default/ddso_loader.cc:60]无法加载动态库'libcusolver.so.10'；dleerror:libcusolver.so.10:无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10:56:28.326117:W tensorflow/stream_executor/platform/default/ddso_loader.cc:60]无法加载动态库'libcusparse.so.11'；dleerror:libcusparse.so.11:无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10:56:28.326215:W tensorflow/stream_executor/platform/default/ddso_loader.cc:60]无法加载动态库'libcudn.so.8'；dlerror:libcudnn.so.8：无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10:56:28.326230:W tensorflow/core/common\runtime/gpu/gpu_device.cc:11757]无法打开某些gpu库。如果你想使用GPU，请确保上面提到的缺失库安装正确。请参阅上的指南https://www.tensorflow.org/install/gpu了解如何下载和设置平台所需的库
正在跳过注册GPU设备。。。

这表明GPU不会在TF应用程序中使用。

我不得不花一些时间来设置虚拟机，所以我将在下面一步一步地发布我的解决方案。

在没有管理员权限的Ubuntu 20.04环境中安装Tensorflow 2.4.x(针对2.4.1进行了测试)的说明。假设系统管理员已经安装了最新的Cuda驱动程序。它由安装Cuda 11.0工具包+CuDNN 8.2.0组成。

以下说明将在目录/home/firath/CUDA_toolkits/CUDA-11.0下安装CUDA 11.0(已测试为适用于Tensorflow 2.4.1)，但没有sudo权限。

步骤1。下载CUDA 11.0

wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run
chmod +x cuda_11.0.2_450.51.05_linux.run

步骤2，选项1：对于快速自动化表单，请使用以下

./cuda_11.0.2_450.51.05_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-11.0

步骤2，选项2：这是一个直观的分步指南

./cuda_11.0.2_450.51.05_linux.run

继续，然后接受EULA。

只选中Cuda Toolkit，取消选中其他所有选项。然后转到"选项"。

进入工具箱选项。

取消选中所有选项，然后转到"更改Toolkit安装路径"，并将其替换为/home/firath/cuda_toolkits/cuda-11.0。完成此步骤后，继续安装。

步骤3。下载CUDA 11.0补丁

wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda_11.0.3_450.51.06_linux.run
chmod +x cuda_11.0.3_450.51.06_linux.run

步骤4。选项1：快速静音模式

./cuda_11.0.3_450.51.06_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-11.0

步骤4。选项2:GUI模式重复步骤2选项2的确切步骤。

安装可能会出现错误在检查日志时，我看到的错误表明安装脚本中可能存在错误。唯一有问题的术语是一个文件的符号链接。

〔错误〕：boost:：filesystem：：create_symlink：文件存在："libcuinj64.so.11.0"home/firath/cuda_toolkits/cuda-1.0/targets/x86_64-linux/lib/libcuinj64.so"；

我在各种分发尝试中遇到了其他几个单一错误(例如，在Ubuntu 16.04上)：
libcuinj64.so.11.0、libaccinj64.so.11.0、libnvrtc内置软件.so.11.0

此错误可以通过以下2行纠正

cd /home/pherath/cuda_toolkits/cuda-11.0/targets/x86_64-linux/lib # move to the dir of the offending line
ln -s libaccinj64.so.11.0 libaccinj64.so #reorder such that symbolic link and target are in correct order (we need libaccinj64.so -> libaccinj64.so.11.0)

步骤5。下载CuDNN 8.2.0

cd /home/pherath/cuda_toolkits # move back to the parent of previous dir

您需要从CuDNN档案下载CuDNN.tgz文件，我使用了v8.2.0。此步骤将要求您在CuDNN创建一个帐户并通过web界面下载。如果你在设置tensorflow的机器上没有GUI，我建议使用"；链接重定向跟踪"；用于跟踪文件下载的确切链接的附加组件(这里是googlechrome附加组件链接)。您可以使用带有GUI的本地计算机跟踪链接，然后使用wget在VM上下载跟踪的链接。请注意，此跟踪链接的使用寿命相对较短。

下载后，名称仍将加密，将其重命名回.tgz

mv $some_ambiguous_name cudnn-11.3-linux-x64-v8.2.0.53.tgz

现在，我们在cuda安装目录的父目录解压缩

tar -xvzf cudnn-11.3-linux-x64-v8.2.0.53.tgz # this will extract things under a dir called 'cuda'

现在，我们需要复制所有的lib64，并将其包括在cuda工具包安装下的相应目录中

cp -fv cuda/lib64/*.* cuda-11.0/lib64/.
cp -fv cuda/include/*.* cuda-11.0/include/.

步骤6。创建/append/prepend PATH和LD_LIBRARY_PATH环境变量

在~/.bashrc的末尾添加以下行(否则，请确保为运行TF脚本的每个bash扩展相应的环境变量)。

export CUDA11=/home/ferath/cuda_tooolkits/cuda-111.0
export PATH=$CUDA11/bin:$PATH
导出LD_LIBRARY_PATH=$CUDA11/lib64:$CUDA11/extras/CUPTI/lib64:$LD_LIBRRAY_PATH

启动新终端或

source ~/.bashrc

在每个现有终端中。

检查安装是否正常

您可以运行以下行来测试TF2.4.1+探查器是否工作：

conda create -n tf python==3.7 -y  # create a python environment
conda activate tf #activate the virtual environment (here conda)
pip install tensorflow==2.4.1 # install tf 2.4.1
python -c "import tensorflow as tf, logging; logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s'); tf.config.list_physical_devices('GPU'); tf.profiler.experimental.start('.'); tf.profiler.experimental.stop()" # test to see if TF with GPU works

#########################################################################

如果你想在Ubuntu 20.04 LTS上安装Cuda Toolkit 10.2，那么单行安装代码会相应地更改(需要添加library_path，并覆盖gcc版本不匹配的抱怨)。

./cuda_10.2.89_440.33.01_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-10.2 --librarypath=/home/pherath/cuda_toolkits/cuda-10.2 --override

请记住，您还需要对cuda工具包10.2的补丁重复此过程。之后，您需要下载相应的cuDNN并复制lib64&包括到cuda工具箱的目录中(与上面的说明相同)。

#########################################################################

如果仍然出现错误，很有可能您没有安装正确的cuda/nvidia驱动程序。为了解决这个问题，你需要sudo权限！

1

首先，清除所有cuda/nvidia内容(由于声誉有限，我无法添加参考…)；基本上用sudo权限运行下面的行。apt clean; apt update; apt purge cuda; apt purge nvidia-*; apt autoremove; apt install cuda

2

遵循谷歌的说明https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#ubuntu-驾驶员步进

3

重新启动机器。

1

2

3

相关内容

最新更新

热门标签：