当我尝试创建一个简单的数据集并将其打印出来时,收到了以下错误消息。
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession
.builder
.appName("SparkSessionExample")
.master("local[4]")
.config("spark.sql.warehouse.dir", "target/spark-warehouse")
.config("spark.driver.bindAddress", "localhost")
.getOrCreate()
# make some test data
columns = ['id', 'dogs', 'cats']
vals = [
(1, 2, 0),
(2, 0, 1)
]
# create DataFrame
df = spark.createDataFrame(vals, columns)
df.show()
File "/Users/USERNAME/server/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 267, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 3.6 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
我尝试了多种方法来重置PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON变量,但都不起作用。希望任何解决这个问题的人都能帮助我解决它!
PYSPARK_PYTHON是位于executor上的PYTHON env,executor使用它来执行您的火花代码
df = spark.createDataFrame(vals, columns)
df.show()
PYSPARK_DRIVER_YTHON是位于驱动程序上的PYTHON env,该驱动程序用于运行您的主PYTHON进程
columns = ['id', 'dogs', 'cats']
vals = [
(1, 2, 0),
(2, 0, 1)
]
您的PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON的版本不匹配。我会尽量确保这两条python存在于驱动程序/执行程序上,并且在同一版本下。
如果您的executor没有任何python,您可以将PYSPARK_python的依赖项发送给executor。