Azure 机器学习服务作业送审失败,并显示:CondaHTTPError:HTTP 000 连接失败



我正在尝试使用代表 https://github.com/microsoft/MLAKSDeployAML/来部署带有 AKS 的 AML 服务。

在一台NC6_v2 DSVM 计算机上创建了它,在努力让 conda 工作之后,我终于完成了环境设置并开始运行笔记本。

我提交实验然后等待 run.wait_for_completion(show_output=True(,它会因 HTTP 错误而轰炸。下面附有完整的控制日志。

这是否与成为GPU机器有关,或者该服务还有其他事情发生?

Streaming log file azureml-logs/60_control_log.txt
Starting the daemon thread to refresh tokens in background for process with pid = 13317
nvidia-docker is installed on the target. Using nvidia-docker for docker operations.
Running: ['/bin/bash', '/tmp/azureml_runs/mlaks-train-on-local_1569245453_408a217b/azureml-environment-setup/docker_env_checker.sh']
Materialized image not found on target: azureml/azureml_473a6fe028e178fff5c9a8d49bc938f3

Logging experiment preparation status in history service.
Running: ['/bin/bash', '/tmp/azureml_runs/mlaks-train-on-local_1569245453_408a217b/azureml-environment-setup/docker_env_builder.sh']
Running: ['nvidia-docker', 'build', '-f', 'azureml-environment-setup/Dockerfile', '-t', 'azureml/azureml_473a6fe028e178fff5c9a8d49bc938f3', '.']
Sending build context to Docker daemon  410.1kB
Step 1/15 : FROM continuumio/miniconda3@sha256:54eb3dd4003f11f6a651b55fc2074a0ed6d9eeaa642f1c4c9a7cf8b148a30ceb
---> 4a51de2367be
Step 2/15 : USER root
---> Using cache
---> 42491a367cef
Step 3/15 : RUN mkdir -p $HOME/.cache
---> Using cache
---> 0771da9ffb76
Step 4/15 : WORKDIR /
---> Using cache
---> a8db57273ffb
Step 5/15 : COPY azureml-environment-setup/99brokenproxy /etc/apt/apt.conf.d/
---> Using cache
---> b2a669b740ca
Step 6/15 : RUN if dpkg --compare-versions `conda --version | grep -oE '[^ ]+$'` lt 4.4.11; then conda install conda==4.4.11; fi
---> Using cache
---> 1e430aeb68b0
Step 7/15 : COPY azureml-environment-setup/mutated_conda_dependencies.yml azureml-environment-setup/mutated_conda_dependencies.yml
---> Using cache
---> 0c6a9fafa84b
Step 8/15 : RUN ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_6303d702d8163bbfc0017533e979d4a3 -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig
---> Running in a579672607b3
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... failed
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/repodata.json>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
ConnectionError(MaxRetryError("HTTPSConnectionPool(host='conda.anaconda.org', port=443): Max retries exceeded with url: /conda-forge/linux-64/repodata.json (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fbb8c38cda0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))"))

The command '/bin/sh -c ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_6303d702d8163bbfc0017533e979d4a3 -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig' returned a non-zero code: 1

CalledProcessError(1, ['nvidia-docker', 'build', '-f', 'azureml-environment-setup/Dockerfile', '-t', 'azureml/azureml_473a6fe028e178fff5c9a8d49bc938f3', '.'])
Building docker image failed with exit code: 1

Logging error in history service: Failed to run ['/bin/bash', '/tmp/azureml_runs/mlaks-train-on-local_1569245453_408a217b/azureml-environment-setup/docker_env_builder.sh'] 
Exit code 1 
Details can be found in azureml-logs/60_control_log.txt log file.
Uploading control log...
Sending final run history status...
Logging experiment failed status in history service.
Control script execution completed

这是一个暂时性的网络问题。请重试

相关内容

最新更新