我正试图从运行在docker中的pyspark访问GCS,为此我有在docker容器中复制的json文件。
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
# This is required if you are using service account and set true,
spark._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable", "true")
然后像下面这样设置GOOGLE_APPLICATION_CREDENTIALS -
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS']=r"abc.json"
现在当尝试访问gcs对象时,它会抛出错误-
df= spark.read.csv("gs://integration-o9/california_housing_train.csv")
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "**111111-compute@developer.gserviceaccount.com** does not have storage.objects.get access to the Google Cloud Storage object. Permission 'stora
ge.objects.get' denied on resource (or it may not exist).",
"reason" : "forbidden"
} ],
"message" : "**111111-compute@developer.gserviceaccount.com** does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage
.objects.get' denied on resource (or it may not exist)."
}
这不是Json文件中提到的服务帐户,但是如果我通过下面设置它-
export GOOGLE_APPLICATION_CREDENTIALS=abc.json
工作得很好任何建议,看看,需要使它通过操作系统的工作。环境属性
在这里张贴答案如下工作(需要在配置部分设置。p12密钥)
spark=SparkSession.builder.appName("test").config("google.cloud.auth.service.account.enable", "true").config("google.cloud.auth.service.account.email", "o9int-759@integration-3-344806.iam.gserviceaccount.com").config("google.cloud.auth.service.account.keyfile", "/path/file.p12").getOrCreate()