我不知道这是从哪里来的,也不知道为什么会发生这个错误:
集群用yaml启动得很好,但当我查看日志时,出现了这个错误。
尽管有错误,它还能工作吗?如何从docker图像中检查打印结果?
雷似乎没有任何";工作";以下示例。我正在尝试推出最简单版本的aws-docker集群,以证明原理。
ray exec /home/user/unit/aws_docker_simple/simple.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Fetched IP: xxxxxxxxx
Warning: Permanently added 'xxxxxxxxx' (ECDSA) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 854, in custom_excepthook
worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'
Original exception was:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 390, in <module>
redis_password=args.redis_password)
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 111, in __init__
self.load_metrics)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 76, in __init__
self.reset(errors_fatal=True)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 490, in reset
raise e
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 452, in reset
self.available_node_types = self.config["available_node_types"]
KeyError: 'available_node_types'
==> /tmp/ray/session_latest/logs/monitor.log <==
==> /tmp/ray/session_latest/logs/monitor.out <==
Shared connection to 18.130.107.42 closed.
Error: Command failed:
ssh -tt -i /home/joe/.ssh/aws_ubuntu_test.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ff32489f9/8dbdda48fb/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@xxxxxxxx bash --login -c -i ''"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it my_simple_docker_container /bin/bash -c '"'"'"'"'"'"'"'"'bash --login -c -i '"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (tail -n 100 -f /tmp/ray/session_latest/logs/monitor*)'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"''"'"'"'"'"'"'"'"' )'"'"''
(base) xxxxx:~/RAY_AWS_DOCKER/3xxxxx/aws_docker_simple$ ray exec /home/xxxxxxxxx/unit/aws_docker_simple/simple.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: xxxxxx
Warning: Permanently added 'xxxxxxxx' (ECDSA) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 854, in custom_excepthook
worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'
Original exception was:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 390, in <module>
redis_password=args.redis_password)
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 111, in __init__
self.load_metrics)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 76, in __init__
self.reset(errors_fatal=True)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 490, in reset
raise e
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 452, in reset
self.available_node_types = self.config["available_node_types"]
KeyError: 'available_node_types'
Dockerfile:
FROM continuumio/miniconda3:4.7.10
CMD ["mkdir", "hello_folder"]
CMD ["echo", "Hello StackOverflow!"]
yaml:
cluster_name: simple
min_workers: 0
max_workers: 2
docker:
image: "xxxxxx/simple "
container_name: "my_simple_docker_container"
pull_before_run: True
idle_timeout_minutes: 5
initialization_commands:
# - curl https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh --output anaconda.sh
# - bash anaconda.sh
# - conda install python=3.8
- sudo apt-get update
- sudo apt-get upgrade
- sudo apt-get install -y python-setuptools
- sudo apt-get install -y build-essential curl unzip psmisc
- pip install --upgrade pip
- pip install discord
- curl -fsSL https://get.docker.com -o get-docker.sh
- sudo sh get-docker.sh
- sudo usermod -aG docker $USER
- sudo systemctl restart docker -f
provider:
type: aws
region: eu-west-2
availability_zone: eu-west-2a
file_mounts_sync_continuously: False
auth:
ssh_user: ubuntu
ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem
head_node:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxxfd2c
KeyName: aws_ubuntu_test
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
worker_nodes:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxxfd2c
KeyName: aws_ubuntu_test
InstanceMarketOptions:
MarketType: spot
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
setup_commands:
- conda install python=3.7
- conda create --name ray
- conda activate ray
- conda install --name ray pip
- pip install --upgrade pip
- pip install discord
- pip install ray
head_setup_commands:
- pip install boto3==1.4.8
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
这是因为光线版本有问题。例如,如果您执行pip-install ray==1.0,它就会起作用。
更好的解决方案是确保两个头簇上的光线与本地光线相同。
您可以使用:
ray --version
本地,也在集群上使用:
ray attach config.yaml