使用 yaml aws 启动 Ray 集群 属性错误:'Worker'对象没有属性'worker_id'



我不知道这是从哪里来的,也不知道为什么会发生这个错误:

集群用yaml启动得很好,但当我查看日志时,出现了这个错误。

尽管有错误,它还能工作吗?如何从docker图像中检查打印结果?

雷似乎没有任何";工作";以下示例。我正在尝试推出最简单版本的aws-docker集群,以证明原理。

ray exec /home/user/unit/aws_docker_simple/simple.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Fetched IP: xxxxxxxxx
Warning: Permanently added 'xxxxxxxxx' (ECDSA) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 854, in custom_excepthook
worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'
Original exception was:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 390, in <module>
redis_password=args.redis_password)
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 111, in __init__
self.load_metrics)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 76, in __init__
self.reset(errors_fatal=True)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 490, in reset
raise e
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 452, in reset
self.available_node_types = self.config["available_node_types"]
KeyError: 'available_node_types'
==> /tmp/ray/session_latest/logs/monitor.log <==
==> /tmp/ray/session_latest/logs/monitor.out <==
Shared connection to 18.130.107.42 closed.
Error: Command failed:
ssh -tt -i /home/joe/.ssh/aws_ubuntu_test.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ff32489f9/8dbdda48fb/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@xxxxxxxx bash --login -c -i ''"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  my_simple_docker_container /bin/bash -c '"'"'"'"'"'"'"'"'bash --login -c -i '"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (tail -n 100 -f /tmp/ray/session_latest/logs/monitor*)'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"''"'"'"'"'"'"'"'"' )'"'"''
(base) xxxxx:~/RAY_AWS_DOCKER/3xxxxx/aws_docker_simple$  ray exec /home/xxxxxxxxx/unit/aws_docker_simple/simple.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: xxxxxx
Warning: Permanently added 'xxxxxxxx' (ECDSA) to the list of known hosts.

==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 854, in custom_excepthook
worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'
Original exception was:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 390, in <module>
redis_password=args.redis_password)
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 111, in __init__
self.load_metrics)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 76, in __init__
self.reset(errors_fatal=True)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 490, in reset
raise e
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 452, in reset
self.available_node_types = self.config["available_node_types"]
KeyError: 'available_node_types'

Dockerfile:

FROM continuumio/miniconda3:4.7.10
CMD ["mkdir", "hello_folder"]
CMD ["echo", "Hello StackOverflow!"]

yaml:

cluster_name: simple
min_workers: 0
max_workers: 2
docker:
image: "xxxxxx/simple "
container_name: "my_simple_docker_container"
pull_before_run: True
idle_timeout_minutes: 5
initialization_commands:
#    - curl https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh --output anaconda.sh
#    - bash anaconda.sh
#    - conda install python=3.8
- sudo apt-get update
- sudo apt-get upgrade
- sudo apt-get install -y python-setuptools
- sudo apt-get install -y build-essential curl unzip psmisc
- pip install --upgrade pip
- pip install discord
- curl -fsSL https://get.docker.com -o get-docker.sh
- sudo sh get-docker.sh
- sudo usermod -aG docker $USER
- sudo systemctl restart docker -f

provider:
type: aws
region: eu-west-2
availability_zone: eu-west-2a
file_mounts_sync_continuously: False

auth:
ssh_user: ubuntu
ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem
head_node:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxxfd2c
KeyName: aws_ubuntu_test
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
worker_nodes:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxxfd2c
KeyName: aws_ubuntu_test
InstanceMarketOptions:
MarketType: spot
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}
setup_commands:
- conda install python=3.7
- conda create --name ray
- conda activate ray
- conda install --name ray pip
- pip install --upgrade pip
- pip install discord
- pip install ray
head_setup_commands:
- pip install boto3==1.4.8
worker_setup_commands:  []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

这是因为光线版本有问题。例如,如果您执行pip-install ray==1.0,它就会起作用。

更好的解决方案是确保两个头簇上的光线与本地光线相同。

您可以使用:

ray --version

本地,也在集群上使用:

ray attach config.yaml

相关内容

  • 没有找到相关文章

最新更新