hdfs-distcp无法从hdfs复制到s3



我们在具有端点http://10.91.16.213:8080的内部暂存节点中配置了一个雪球。这一切都正常工作,我甚至可以通过s3cli命令列出这个雪球中的文件

aws s3 ls my-bucket/data/ --endpoint-url=http://10.91.16.213:8080`

现在,我尝试使用hadoop-distcp命令将数据从hdfs复制到s3雪球。首先,我测试了hadoopdistcp命令,将一些文件复制到我的aws帐户中的实际s3测试桶中

hadoop distcp 
-Dfs.s3a.fast.upload=true 
-Dfs.s3a.access.key=AKIAUPWDYDZTSGWUWJWN 
-Dfs.s3a.secret.key=<my-secret>  
hdfs://path/to/data/ 
s3a://test-bucket-anum/

上面的命令执行良好,并在hadoop集群中启动复制作业。现在,为了复制到我的内部雪球,我所要做的就是更改端点。这就是我所尝试的;

hadoop distcp 
-Dfs.s3a.endpoint=http://10.91.16.213:8080 
-Dfs.s3a.fast.upload=true 
-Dfs.s3a.access.key=AKIACEMGMYDQNJXGQ2DEOBXG42SQCFR2ZJFTDED3HX3KLVTLOIN6AH3FSDHUF 
-Dfs.s3a.secret.key=<snowball-secret>   
hdfs://path/to/data/ 
s3a://my-bucket/

上述命令失败,出现以下错误;

20/09/02 19:20:22 INFO s3a.S3AFileSystem: Caught an AmazonClientException, which means the client encountered a serious internal problem while trying to communicate with S3, such as not being able to access the network.
20/09/02 19:20:22 INFO s3a.S3AFileSystem: Error Message: {}com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:962)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:217)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:116)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:430)
Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:279)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:75)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:72)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
... 12 more
Caused by: java.lang.RuntimeException: Invalid value for IsTruncated field: 
true
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler.endElement(XmlResponsesSaxParser.java:647)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
... 18 more
20/09/02 19:20:22 ERROR tools.DistCp: Invalid arguments: 
com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:962)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:217)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:116)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:430)
Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:279)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:75)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:72)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
... 12 more
Caused by: java.lang.RuntimeException: Invalid value for IsTruncated field: 
true
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler.endElement(XmlResponsesSaxParser.java:647)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
... 18 more
Invalid arguments: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append                Reuse existing data in target files and append new
data to them if possible
-async                 Should distcp execution be blocking
-atomic                Commit all changes or none
-bandwidth <arg>       Specify bandwidth per map in MB
-delete                Delete from target, files missing in source
-diff <arg>            Use snapshot diff report to identify the
difference between source and target
-f <arg>               List of files that need to be copied
-filelimit <arg>       (Deprecated!) Limit number of files copied to <= n
-i                     Ignore failures during copy
-log <arg>             Folder on DFS where distcp execution logs are
saved
-m <arg>               Max number of concurrent maps to use for copy
-mapredSslConf <arg>   Configuration for ssl config file, to use with
hftps://
-overwrite             Choose to overwrite target files unconditionally,
even if they exist.
-p <arg>               preserve status (rbugpcaxt)(replication,
block-size, user, group, permission,
checksum-type, ACL, XATTR, timestamps). If -p is
specified with no <arg>, then preserves
replication, block size, user, group, permission,
checksum type and timestamps. raw.* xattrs are
preserved when both the source and destination
paths are in the /.reserved/raw hierarchy (HDFS
only). raw.* xattrpreservation is independent of
the -p flag. Refer to the DistCp documentation for
more details.
-sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n
bytes
-skipcrccheck          Whether to skip CRC checks between source and
target paths.
-strategy <arg>        Copy strategy to use. Default is dividing work
based on file sizes
-tmp <arg>             Intermediate work path to be used for atomic
commit
-update                Update target, copying only missingfiles or
directories

以下是我也尝试过的其他一些hadoop配置,但没有成功。

-Dfs.s3a.connection.ssl.enabled=false:因为我的端点是http。

-Dfs.s3a.region=eu-west-1

我是不是错过了什么?

更新:

由于错误消息还包括Invalid arguments:,我想我可能在参数中提供了一些无效的字符,所以我尝试在/etc/hadoop/conf/core-site.xml中写入这些选项,如下所示;

<property>
<name>fs.s3a.endpoint</name>
<value>http://10.91.16.213:8080</value>
</property>
<property>
<name>fs.s3a.fast.upload</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>AKIACEMGMYDQNJXGQ2DEOBXG42SQCFR2ZJFTDED3HX3KLVTLOIN6AH3FSDHUF</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value><snowball-secret></value>
</property>

但收到相同的错误消息:(

更新2:

看完这篇文章后,它看起来像是在执行ListObjects时的s3xml解析问题。AWS客户端有这个选项.withEncodingType("url");,但找不到类似的hadoop-distcp。

s3a连接器不支持使用S3雪球设备,除非有人坐下来实现所有HADOOP-14710,否则不会这样做。

截至2022年2月,没有人这样做。如果你或其他阅读本页的人想要功能

  1. 检查JIRA,看看它是否完成了,如果完成了,请使用具有该功能的hadoop版本
  2. 如果它根本不固定,这就是你为社区做出贡献的机会

最新更新