将数据从 ADLS Gen 2 加载到 Azure Synapse



我正在尝试使用polybase外部表功能将Parquet文件从ADLS Gen2加载到Synapse。

下面是代码,但在运行创建外部表命令时,查询永远不会完成。在取消查询执行时,我看到此错误 -

由于内部错误,外部文件访问失败:"访问 HDFS 时出错:调用HdfsBridge_IsDirExist时引发 Java 异常。Java 异常消息: HdfsBridge::isDirExists - 检查目录是否存在时遇到意外错误:未知主机异常:".azuredatalakestore.dfs.core.windows.net">

SQL查询

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO
DROP CREDENTIAL ADLSCredential
CREATE DATABASE SCOPED CREDENTIAL ADLSCredential
WITH
IDENTITY = 'user',
SECRET = '<secret-key>'
;
CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage
WITH (
TYPE = HADOOP,
LOCATION = 'abfss://<container>@<storage-account>.azuredatalakestore.dfs.core.windows.net',
CREDENTIAL = ADLSCredential
);
-- Create an external file format for PARQUET files.  
CREATE EXTERNAL FILE FORMAT parquet  
WITH (  
FORMAT_TYPE = PARQUET,  
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'  
); 
CREATE EXTERNAL FILE FORMAT uncompressedcsv
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '',
DATE_FORMAT = '',
USE_TYPE_DEFAULT = False
)
);
CREATE EXTERNAL TABLE [dbo].[CashReceipts_external] (
[AMOUNT_APPLIED] [float] NOT NULL,
[TRX_NUMBER] [nvarchar](50) NULL,
[SHORT_NAME] [nvarchar](50) NOT NULL,
[NAME] [nvarchar](1) NULL,
[CURRENT_RECORD_FLAG] [nvarchar](50) NULL,
[CURRENCY_CODE] [nvarchar](50) NULL,
[FUNC_CURRENCY_CODE] [nvarchar](50) NOT NULL,
[CASH_RCPT_AMOUNT] [float] NULL,
[CASH_HISTORY_AMOUNT] [float] NULL,
[FUNC_AMT_HISTORY] [float] NULL,
[STATUS] [nvarchar](50) NULL,
[ANTICIPATED_CLEARING_DATE] [nvarchar](50) NULL,
[CASH_HISTORY_EXCHANGE_RATE] [nvarchar](50) NULL,
[GL_DATE] [datetime2](7) NULL,
[GL_PERIOD] [datetime2](7) NOT NULL,
[BATCH_GL_DATE] [nvarchar](1) NULL,
[EXCHANGE_RATE] [nvarchar](50) NULL,
[RECEIPT_NUMBER] [nvarchar](50) NULL,
[DEPOSIT_DATE] [datetime2](7) NULL,
[RECEIPT_DATE] [datetime2](7) NULL,
[ISSUE_DATE] [nvarchar](1) NULL,
[TYPE] [nvarchar](50) NULL,
[GL_POSTED_DATE] [datetime2](7) NULL,
[AMOUNT] [float] NULL
)
WITH
(
LOCATION='parquetfiles'
,   DATA_SOURCE = AzureDataLakeStorage
,   FILE_FORMAT = parquet
,   REJECT_TYPE = VALUE
,   REJECT_VALUE = 0
)
;

"创建外部数据源"命令具有属性位置

LOCATION = 'abfss://<container>@<storage-account>.azuredatalakestore.dfs.core.windows.net'

它应该是

LOCATION = 'abfss://<container>@<storage-account>.dfs.core.windows.net'

我把它与 ADLS 第 1 代位置属性混淆了。我的坏。感谢大家抽出时间研究这个问题。将其标记为已关闭。我最终使用了 AAD 应用注册令牌而不是存储密钥。

根据错误消息,错误是由位置"镶木地板文件"引起的。

请尝试以下CREATE EXTERNAL DATA SOURCE命令:

CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage
WITH
( LOCATION = 'wasbs://<container>@<storage_account>.blob.core.windows.net' ,
CREDENTIAL = AzureStorageCredential ,
TYPE = BLOB_STORAGE
) ;

CREATE EXTERNAL TABLE时,请使用文件或文件夹名称:

CREATE EXTERNAL TABLE [dbo].[CashReceipts_external] (
[AMOUNT_APPLIED] [float] NOT NULL,
[TRX_NUMBER] [nvarchar](50) NULL,
[SHORT_NAME] [nvarchar](50) NOT NULL,
[NAME] [nvarchar](1) NULL,
[CURRENT_RECORD_FLAG] [nvarchar](50) NULL,
[CURRENCY_CODE] [nvarchar](50) NULL,
[FUNC_CURRENCY_CODE] [nvarchar](50) NOT NULL,
[CASH_RCPT_AMOUNT] [float] NULL,
[CASH_HISTORY_AMOUNT] [float] NULL,
[FUNC_AMT_HISTORY] [float] NULL,
[STATUS] [nvarchar](50) NULL,
[ANTICIPATED_CLEARING_DATE] [nvarchar](50) NULL,
[CASH_HISTORY_EXCHANGE_RATE] [nvarchar](50) NULL,
[GL_DATE] [datetime2](7) NULL,
[GL_PERIOD] [datetime2](7) NOT NULL,
[BATCH_GL_DATE] [nvarchar](1) NULL,
[EXCHANGE_RATE] [nvarchar](50) NULL,
[RECEIPT_NUMBER] [nvarchar](50) NULL,
[DEPOSIT_DATE] [datetime2](7) NULL,
[RECEIPT_DATE] [datetime2](7) NULL,
[ISSUE_DATE] [nvarchar](1) NULL,
[TYPE] [nvarchar](50) NULL,
[GL_POSTED_DATE] [datetime2](7) NULL,
[AMOUNT] [float] NULL
)
WITH
(
LOCATION='[filename]'
,   DATA_SOURCE = AzureDataLakeStorage
,   FILE_FORMAT = parquet
,   REJECT_TYPE = VALUE
,   REJECT_VALUE = 0
)
;

裁判:

  1. 创建外部数据源以引用 Azure Blob 存储
  2. 创建外部表: 参数:

LOCATION = 'folder_or_filepath' 指定 Hadoop 或 Azure Blob 存储中实际数据的文件夹或文件路径和文件名。

如果将 LOCATION 指定为文件夹,则从外部表中进行选择的 PolyBase 查询将从该文件夹及其所有子文件夹中检索文件。就像Hadoop一样,PolyBase不返回隐藏文件夹。它也不会返回文件名以下划线 (_( 或句点 (.( 开头的文件。

在此示例中,如果 LOCATION='/webdata/',则 PolyBase 查询将返回来自 mydata.txt 和 mydata2.txt 的行。它不会返回 mydata3.txt因为它是隐藏文件夹中的文件。它不会返回_hidden.txt,因为它是一个隐藏文件。

请注意:一个表一个文件,我们可以加载多个文件来创建外部表!

希望这有帮助。

最新更新