Hadoop S3 驱动程序 403 在几个成功的请求后出错



我正在使用带有 Apache Nutch 的 AWS S3 驱动程序将文件从 EC2 实例上传到 S3 存储桶。EC2 附加了 IAM 策略以允许访问 S3 存储桶:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::storage"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:GetObjectAcl"
],
"Resource": [
"arn:aws:s3:::storage/*"
]
}
]
}

它在开始时工作正常:Nutch 解析段并将其写入 S3 存储桶,但在几个段之后,它失败并出现错误:

状态代码:403,AWS 服务:Amazon S3,AWS 请求 ID:...,AWS 错误代码:SignatureDoesNotMatch,AWS 错误消息:我们计算的请求签名与您提供的签名不匹配。

com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: ...
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1507)
at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:143)
at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:131)
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.copy(CopyMonitor.java:189)
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:134)
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[ERROR] org.apache.nutch.crawl.CrawlDb: CrawlDb update job did not succeed, job status:FAILED, reason: NA
Exception in thread "main" java.lang.RuntimeException: CrawlDb update job did not succeed, job status:FAILED, reason: NA
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:142)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:83)

我想 IAM 策略是可以的,因为 Nutch 可以在失败之前上传几个段。

我的 AWS Hadoop 相关配置是:

com.amazonaws.services.s3.enableV4=true
fs.s3a.endpoint=s3.us-east-2.amazonaws.com

为什么我会收到此错误以及如何解决它?


更新: 我在单个EC2机器(不是Hadoop集群)上以编程方式(不是从CLI)运行Nutch,以访问我正在使用的S3s3a文件系统(输出路径为s3a://mybucket/data)。 Hadoop版本是2.7.3,Nutch版本是1.15

由于 S3 不一致的副作用,在本地模式下运行时会出现上述错误。

由于 S3

仅提供写后读取的最终一致性,因此无法保证在列出文件或尝试重命名文件时,即使之前刚刚写入文件,它也会存在于 S3 存储桶中。

Hadoop团队还提供了以下故障排除指南: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

如果您的用例需要在本地模式下运行,我建议以下解决方法:

  1. 将文件写入local-folder
  2. 使用aws s3 sync local-folder s3://bucket-name --region region-name --delete

相关内容

最新更新