conda 环境未在计算集群上的 bash 脚本作业中激活



我正试图在计算集群CentOS7上提交SLURM作业。该作业是一个python文件(cifar100-vgg16.py(,它需要tensorflow gpu 2.8.1,我已经将其安装在conda环境(tf_gpu(中。我提交给SLURM(我们的作业调度程序(的bash脚本如下所示。SLURM输出文件显示,使用的环境是基本的conda环境Python/3.6.4-foss-2018a(具有tensorflow 1.10.1(,而不是tf_gpu。请就如何解决提出建议。

Bash脚本:

#!/bin/bash --login
########## SBATCH Lines for Resource Request ##########

#SBATCH --time=00:10:00             # limit of wall clock time - how long the job will run (same as -t)
#SBATCH --nodes=1                   # the number of node requested.
#SBATCH --ntasks=1                  # the number of task to run
#SBATCH --cpus-per-task=1           # the number of CPUs (or cores) per task (same as -c)
#SBATCH --mem-per-cpu=2G            # memory required per allocated CPU (or core) - amount of memory (in bytes)
#SBATCH --job-name test2            # you can give your job a name for easier identification (same as -J)

########## Command Lines to Run ##########
conda activate tf_gpu
python cifar100-vgg16.py

SLURM输出文件:

> /opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Tensorflow version 1.10.1
Keras version 2.1.6-tf
Scikit learn version 0.20.0
Traceback (most recent call last):
File "cifar100-vgg16.py", line 39, in <module>
print("Number of GPUs Available:", len(tensorflow.config.experimental.list_physical_devices('GPU')))
AttributeError: module 'tensorflow' has no attribute 'config'

作业脚本中有一个错误。将conda activate tf_gpu替换为source activate tf_gpu

此外,我想你需要加载模块以便使用它。这将类似于module load anaconda检查module avail的可用模块列表。

但是看起来您的HPC不需要模块加载,因为它标识了没有module loadconda

编辑:FlytingTeller说source activate将在2017年被conda activate取代。我知道这一点。

我不知道这是否适用于HPCs。为了证明我的观点,这里是Swansea SUNBIRD的输出,当我使用conda activate时。

(base) hell@Dell-Precision-T7910:~$ ssh sunbird 
Last login: Wed Aug 10 15:30:29 2022 from en003013.swan.ac.uk
====================== Supercomputing Wales - Sunbird ========================
This system is for authorised users, if you do not have authorised access
please disconnect immediately, and contact Technical Support.
-----------------------------------------------------------------------------
For user guides, documentation and technical support:
Web: http://portal.supercomputing.wales
-------------------------- Message of the Day -------------------------------
SUNBIRD has returned to service unchanged.  Further information on 
the maintenance outage and future work will be distributed soon.
===============================================================================
[s.1915438@sl2 ~]$ module load anaconda/3
[s.1915438@sl2 ~]$ conda activate base
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
If your shell is Bash or a Bourne variant, enable conda for the current user with
$ echo ". /apps/languages/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc
or, for all users, enable conda with
$ sudo ln -s /apps/languages/anaconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh
The options above will permanently enable the 'conda' command, but they do NOT
put conda's base (root) environment on PATH.  To do so, run
$ conda activate
in your terminal, or to put the base environment on PATH permanently, run
$ echo "conda activate" >> ~/.bashrc
Previous to conda 4.4, the recommended way to activate conda was to modify PATH in
your ~/.bashrc file.  You should manually remove the line that looks like
export PATH="/apps/languages/anaconda3/bin:$PATH"
^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^

[s.1915438@sl2 ~]$ source activate base
(base) [s.1915438@sl2 ~]$ 

以下是我使用conda activate时Cardiff HAWK的输出。

(base) hell@Dell-Precision-T7910:~$ ssh cardiff 
Last login: Tue Aug  2 09:32:44 2022 from en003013.swan.ac.uk
======================== Supercomputing Wales - Hawk ==========================
This system is for authorised users, if you do not have authorised access
please disconnect immediately, and contact Technical Support.
-----------------------------------------------------------------------------
For user guides, documentation and technical support:
Web: http://portal.supercomputing.wales
-------------------------- Message of the Day -------------------------------
- WGP Gluster mounts are now RO on main login nodes.
- WGP RW access is via Ser Cymru system or dedicated access VM.
===============================================================================
[s.1915438@cl1 ~]$ module load anaconda/
anaconda/2019.03  anaconda/2020.02  anaconda/3        
anaconda/2019.07  anaconda/2021.11  
[s.1915438@cl1 ~]$ module load anaconda/2021.11 
INFO: To setup environment run:
eval "$(/apps/languages/anaconda/2021.11/bin/conda shell.bash hook)"
or just:
source activate
[s.1915438@cl1 ~]$ conda activate
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.

[s.1915438@cl1 ~]$ conda activate base
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.

[s.1915438@cl1 ~]$ source activate base
(2021.11)[s.1915438@cl1 ~]$ 

conda版本肯定是在2020年之后,而不是2017年。因此,问题是关于HPC集群的,我为什么说用source activate替换conda activate,以激活第一个conda环境。

有人能解释一下吗?

第二版:我想我有个解释。

[s.1915438@sl2 ~]$ cat ~/.bashrc
# .bashrc
# Dynamically generated by: genconfig  (Do not edit!)
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# Load saved modules
module load null
# Personal settings file
if [ -f $HOME/.myenv ]
then
source $HOME/.myenv
fi

~/.bashrc不包含到conda.sh的路径。我认为这对许多HPCs来说都是正确的。

最新更新