使用Pyspark访问Azure Blob表时获得身份验证错误



我们正在尝试访问Azure表来获取数据进行数据分析活动。但是,即使我们正在传递正确的SAS令牌,但得到验证错误如下:

访问方法:

import traceback
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.conf.set('fs.azure.account.key.jimtestdiag924.blob.core.windows.net', '<SAS Token>')
gps = spark.sql(f"select * from <schema>.<table_name>")

错误:

Py4JJavaError: An error occurred while calling o145.load.
: org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.retrieveMetadata(AzureNativeFileSystemStore.java:2152)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatusInternal(NativeAzureFileSystem.java:2660)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:2605)
at org.apache.hudi.common.util.TablePathUtils.getTablePath(TablePathUtils.java:50)
at org.apache.hudi.DataSourceUtils.getTablePath(DataSourceUtils.java:75)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:84)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:63)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:87)
at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:315)
at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:185)
at com.microsoft.azure.storage.blob.CloudBlob.exists(CloudBlob.java:1994)
at com.microsoft.azure.storage.blob.CloudBlob.exists(CloudBlob.java:1981)
at org.apache.hadoop.fs.azure.StorageInterfaceImpl$CloudBlobWrapperImpl.exists(StorageInterfaceImpl.java:333)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.retrieveMetadata(AzureNativeFileSystemStore.java:2087)
... 23 more

但是如果我在执行上面的查询之前先加载parquet文件,它会工作得很好

Parquet文件加载/查询方法:

spark.sql("select * from parquet.`wasb://oemdpv3prd-v1@oemdpv3prd.blob.core.windows.net/data/pipelines/<schema_name>/<folder_name>`")

此处文件夹名称与表名称相同。

请帮助我理解为什么PySpark-Azure显示这样的行为。

我们也与Azure支持团队进行了会面,但他们也没有发现任何问题。

如果您使用的是SAS令牌,那么您需要使用spark配置名称fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net,而不是fs.azure.key.<container-name>.<storage-account-name>.blob.core.windows.net-如果您使用fs.azure.key,则需要提供存储密钥。

最新更新