按照说明,我试图在Azure AKS免费层上部署我的pyspark应用程序与spark.executor.instances=5
spark-submit
--master k8s://https://xxxxxxx-xxxxxxx.hcp.westeurope.azmk8s.io:443
--deploy-mode cluster
--name sparkbasics
--conf spark.executor.instances=5
--conf spark.kubernetes.container.image=aosb06.azurecr.io/sparkbasics:v300
local:///opt/spark/work-dir/main.py
一切正常(包括应用程序本身),除了我没有看到任何执行器pod,只有驱动程序pod。
kubectl get pods
NAME READY STATUS RESTARTS AGE
sparkbasics-f374377b3c78ac68-driver 0/1 Completed 0 52m
Dockerfile来自Spark发行版
会有什么问题?资源分配有问题吗?
在驱动日志中似乎没有问题。
kubectl logs <driver-pod>
021-08-12 22:25:54,332 INFO spark.SparkContext: Running Spark version 3.1.2
2021-08-12 22:25:54,378 INFO resource.ResourceUtils: ==============================================================
2021-08-12 22:25:54,378 INFO resource.ResourceUtils: No custom resources configured for spark.driver.
2021-08-12 22:25:54,379 INFO resource.ResourceUtils: ==============================================================
2021-08-12 22:25:54,379 INFO spark.SparkContext: Submitted application: SimpleApp
2021-08-12 22:25:54,403 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2021-08-12 22:25:54,422 INFO resource.ResourceProfile: Limiting resource is cpu
2021-08-12 22:25:54,422 INFO resource.ResourceProfileManager: Added ResourceProfile id: 0
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing view acls to: 185,aovsyannikov
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing modify acls to: 185,aovsyannikov
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing view acls groups to:
2021-08-12 22:25:54,475 INFO spark.SecurityManager: Changing modify acls groups to:
2021-08-12 22:25:54,475 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185, aovsyannikov); groups with view permissions: Set(); users with modify permissions: Set(185, aovsyannikov); groups with modify permissions: Set()
2021-08-12 22:25:54,717 INFO util.Utils: Successfully started service 'sparkDriver' on port 7078.
2021-08-12 22:25:54,781 INFO spark.SparkEnv: Registering MapOutputTracker
2021-08-12 22:25:54,818 INFO spark.SparkEnv: Registering BlockManagerMaster
2021-08-12 22:25:54,843 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2021-08-12 22:25:54,844 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
2021-08-12 22:25:54,848 INFO spark.SparkEnv: Registering BlockManagerMasterHeartbeat
2021-08-12 22:25:54,862 INFO storage.DiskBlockManager: Created local directory at /var/data/spark-1e9aa64b-e0a1-44ae-a097-ebb3c2f32404/blockmgr-c51b9095-5426-4a00-b17a-461de2b80357
2021-08-12 22:25:54,892 INFO memory.MemoryStore: MemoryStore started with capacity 413.9 MiB
2021-08-12 22:25:54,909 INFO spark.SparkEnv: Registering OutputCommitCoordinator
2021-08-12 22:25:55,023 INFO util.log: Logging initialized @3324ms to org.sparkproject.jetty.util.log.Slf4jLog
2021-08-12 22:25:55,114 INFO server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_275-b01
2021-08-12 22:25:55,139 INFO server.Server: Started @3442ms
2021-08-12 22:25:55,184 INFO server.AbstractConnector: Started ServerConnector@59b3b32{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
2021-08-12 22:25:55,184 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
kubectl describe pod <driver-pod>
Name: sparkbasics-f374377b3c78ac68-driver
Namespace: default
Priority: 0
Node: aks-default-31057657-vmss000000/10.240.0.4
Start Time: Fri, 13 Aug 2021 01:25:47 +0300
Labels: spark-app-selector=spark-256cc7f64af9451b89e0098397980974
spark-role=driver
Annotations: <none>
Status: Succeeded
IP: 10.244.0.28
IPs:
IP: 10.244.0.28
Containers:
spark-kubernetes-driver:
Container ID: containerd://b572a4056014cd4b0520b808d64d766254d30c44ba12fc98717aee3b4814f17d
Image: aosb06.azurecr.io/sparkbasics:v300
Image ID: aosb06.azurecr.io/sparkbasics@sha256:965393784488025fffc7513edcb4a62333ba59a5ee3076346fd8d335e1715213
Ports: 7078/TCP, 7079/TCP, 4040/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
driver
--properties-file
/opt/spark/conf/spark.properties
--class
org.apache.spark.deploy.PythonRunner
local:///opt/spark/work-dir/main.py
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 13 Aug 2021 01:25:51 +0300
Finished: Fri, 13 Aug 2021 01:56:40 +0300
Ready: False
Restart Count: 0
Limits:
memory: 1433Mi
Requests:
cpu: 1
memory: 1433Mi
Environment:
SPARK_USER: aovsyannikov
SPARK_APPLICATION_ID: spark-256cc7f64af9451b89e0098397980974
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
SB_KEY_STORAGE: <set to the key 'STORAGE' in secret 'sparkbasics'> Optional: false
SB_KEY_OPENCAGE: <set to the key 'OPENCAGE' in secret 'sparkbasics'> Optional: false
SB_KEY_STORAGEOUT: <set to the key 'STORAGEOUT' in secret 'sparkbasics'> Optional: false
SPARK_LOCAL_DIRS: /var/data/spark-1e9aa64b-e0a1-44ae-a097-ebb3c2f32404
SPARK_CONF_DIR: /opt/spark/conf
Mounts:
/opt/spark/conf from spark-conf-volume-driver (rw)
/var/data/spark-1e9aa64b-e0a1-44ae-a097-ebb3c2f32404 from spark-local-dir-1 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-wlqjt (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
spark-local-dir-1:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
spark-conf-volume-driver:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: spark-drv-6f83b17b3c78af1f-conf-map
Optional: false
default-token-wlqjt:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-wlqjt
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
我发现pyspark应用程序本身有一个错误。
...
SparkSession.builder.master("local")
...
应该没有master
...
SparkSession.builder
...
就这么简单:(