我尝试通过提交python文件在批处理模式下执行livy,但它不起作用,我尝试了两种方法-
- 从本地文件系统运行py文件&还
- 在hdfs上复制它…但这行不通……
请帮
hduser@tarun-ubuntu:/home/tarun/spark/examples/src/main/python$ curl -X POST -H "Content-Type: application/json" tarun-ubuntu:8998/batches --data '{"file": "file:///home/tarun/spark/examples/src/main/python/pi.py", "name": "pipy", "executorCores":1, "executorMemory":"512m", "driverCores":1, "driverMemory":"512m", "queue":"default", "args":["10"]}'
"requirement failed: Local path /home/tarun/spark/examples/src/main/python/pi.py cannot be added to user sessions."
所以我把pi.py移到hdfs at live至少接受curl调用:
hduser@tarun-ubuntu:/home/tarun/spark/examples/src/main/python$ curl -X POST -H "Content-Type: application/json" tarun-ubuntu:8998/batches --data '{"file": "/pi.py", "name": "pipy", "executorCores":1, "executorMemory":"512m", "driverCores":1, "driverMemory":"512m", "queue":"default", "args":["10"]}'
{"id":20,"state":"running","appId":null,"appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":[]}
但是当我检查日志时:
$ curl tarun-ubuntu:8998/batches/20/log | python -m json.tool % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1415 100 1415 0 0 186k 0 --:--:-- --:--:-- --:--:-- 197k
{
"from": 0,
"id": 20,
"log": [
"Error: Only local python files are supported: Parsed arguments:",
" master local",
" deployMode client",
" executorMemory 512m",
" executorCores 1",
" totalExecutorCores null",
" propertiesFile /home/tarun/spark/conf/spark-defaults.conf",
" driverMemory 512m",
" driverCores 1",
" driverExtraClassPath null",
" driverExtraLibraryPath null",
" driverExtraJavaOptions null",
" supervise false",
" queue default",
" numExecutors null",
" files null",
" pyFiles null",
" archives null",
" mainClass null",
" primaryResource hdfs://localhost:54310/pi.py",
" name pipy",
" childArgs [10]",
" jars null",
" packages null",
" packagesExclusions null",
" repositories null",
" verbose false",
"",
"Spark properties used, including those specified through",
" --conf and those from the properties file /home/tarun/spark/conf/spark-defaults.conf:",
" spark.driver.memory -> 512m",
" spark.executor.memory -> 512m",
" spark.driver.cores -> 1",
" spark.master -> local",
" spark.executor.cores -> 1",
"",
" .primaryResource",
"Run with --help for usage help or --verbose for debug output"
],
"total": 38
}
curl tarun-ubuntu:8998/batches/20 | python -m json.tool % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 482 100 482 0 0 105k 0 --:--:-- --:--:-- --:--:-- 117k
{
"appId": null,
"appInfo": {
"driverLogUrl": null,
"sparkUiUrl": null
},
"id": 20,
"log": [
"Spark properties used, including those specified through",
" --conf and those from the properties file /home/tarun/spark/conf/spark-defaults.conf:",
" spark.driver.memory -> 512m",
" spark.executor.memory -> 512m",
" spark.driver.cores -> 1",
" spark.master -> local",
" spark.executor.cores -> 1",
"",
" .primaryResource",
"Run with --help for usage help or --verbose for debug output"
],
"state": "dead"
}
错误Only local python files are supported
很可能是由Spark抛出的,因为Livy默认将HDFS前缀附加到您的文件路径。
你应该尝试两件事:
-
将要读取py文件的目录添加到
livy.conf
中的livy.file.local-dir-whitelist
设置中。根据配置文件中的注释,应用程序"在启动会话时只能引用远程uri"。这很可能是Livy在提交作业时默认使用HDFS的原因。 -
当您将
file
参数传递给REST API时,在file:/
之后使用一个斜杠。例如:{"file": "file:/home/tarun/spark/examples/src/main/python/pi.py"}
。我相信这是正确的语法
在集群模式下运行时需要注意的一件事:
注意,这个URL应该可以被Spark驱动进程访问。如果在集群模式下运行驱动程序,它可能驻留在不同的主机上,这意味着"file:"url必须存在于该节点上(而不是在客户端机器上)。
换句话说,您可能需要在集群中的每个节点上都有一个py文件的副本,以确保驱动程序可以读取该文件。
希望对你有帮助。