在数据迁移过程中,我正在复制数百万个S3文件,并希望执行大量并行复制。我使用的是Java SDK v2 API。当我尝试一次复制大量S3文件时,我总是会遇到罕见的、零星的异常。
我最常见的例外情况是:
Unable to execute HTTP request: Server failed to send complete response. The channel was closed. This may have been done by the client (e.g. because the request was aborted), by the service (e.g. because there was a handshake error, the request took too long, or the client tried to write on a read-only socket), or by an intermediary party (e.g. because the channel was idle for too long).
我还得到:
Unable to execute HTTP request: The channel was closed. This may have been done by the client (e.g. because the request was aborted), by the service (e.g. because there was a handshake error, the request took too long, or the client tried to write on a read-only socket), or by an intermediary party (e.g. because the channel was idle for too long).
We encountered an internal error. Please try again. (Service: S3, Status Code: 500 ...
Unable to execute HTTP request: Channel was closed before it could be written to.
以下是示例代码,对我来说,它似乎最可靠地触发了问题。(问题可能取决于糟糕/繁忙的S3服务器节点、网络流量、节流或竞争条件,因此不容易100%可靠地再现问题(
NettyNioAsyncHttpClient.Builder asyncHttpClientBuilder = NettyNioAsyncHttpClient.builder()
.maxConcurrency(300)
.maxPendingConnectionAcquires(500)
.connectionTimeout(Duration.ofSeconds(10))
.connectionAcquisitionTimeout(Duration.ofSeconds(60))
.readTimeout(Duration.ofSeconds(60));
S3AsyncClient s3Client = S3AsyncClient.builder()
.credentialsProvider(awsCredentialsProvider)
.region(Region.US_EAST_1)
// You get the same failures even with retries -- it just takes longer
.overrideConfiguration(config -> config.retryPolicy(RetryPolicy.none()).build())
.httpClientBuilder(asyncHttpClientBuilder)
.build();
List<CompletableFuture> futures = new ArrayList<>();
for (int i = 1; i <= 500; i++) {
String key = "zerotestfile";
Path outFile = Paths.get("/tmp/experiment/").resolve(key + "-" + i);
outFile.getParent().toFile().mkdirs();
if (outFile.toFile().exists()) {
outFile.toFile().delete();
}
log.info("Downloading: {} ({})", key, i);
GetObjectRequest request = GetObjectRequest.builder()
.bucket("my-test-bucket")
.key(key)
.build();
CompletableFuture<GetObjectResponse> future = s3Client.getObject(request, AsyncResponseTransformer.toFile(outFile))
.exceptionally(exception -> {
log.error("Error", exception);
return null;
});
futures.add(future);
}
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
通过dd if=/dev/zero of=zerotestfile bs=1024 count=500
生成的500k文件
使用重试条件(默认情况(实际上可以修复上面的示例。但在实际迁移复制数百万个文件时,我使用了重试条件,这很有帮助,但我仍然会遇到该示例产生的这些确切的异常。
其他详细信息:我的实际迁移逻辑使用跨区域CopyObject调用。为了简化问题,我将示例切换为单区域GetObject请求。我可以让它产生与上面代码类似的错误,但我必须使用maxConcurrent2000执行2500次复制。
我简化了我的S3配置,只保留了防止上述示例死亡的内容。我通过适当的配置更改修复了以下错误:
错误:Unable to execute HTTP request: Acquire operation took longer than the configured maximum time.
添加:.connectionAcquisitionTimeout(Duration.ofSeconds(60))
错误:Unable to execute HTTP request: connection timed out
添加:.connectionTimeout(Duration.ofSeconds(10))
错误:ReadTimeoutException: null
添加:.readTimeout(Duration.ofSeconds(60))
错误:Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate. Consider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.
添加:.maxPendingConnectionAcquires(500)
来源:我的例子最初是基于(但经过了大量修改(aws java SDK中错误报告的代码片段,该代码片段显然已修复:https://github.com/aws/aws-sdk-java-v2/issues/1122
请注意,各种其他相关问题(通常也是AWS Java SDK v2(也会出现类似的异常。我欢迎在这里回答/评论相关问题。如果它们是由AWS SDK错误引起的,人们会打开github错误报告。看见https://github.com/aws/aws-sdk-java-v2/issues
为了使重试逻辑正常工作,我必须添加:
.retryCapacityCondition(null)
(请参阅方法文档,其中指定您应该传入"null"以禁用它(
默认行为是,如果s3客户端全局命中太多错误,则禁用重试。问题是我正在执行大量复制,并且经常出现错误,我仍然想重试。
这个解决方案现在似乎很明显,但我花了2天多的时间才弄清楚,特别是因为错误很难可靠地再现。99.99%的时间,它是有效的。但我的迁移总是失败,跳过了大约一百万分之一百的文件。我制作了自己的手动重试逻辑(因为s3重试并没有解决问题(,这很有效,但我更努力地搜索,找到了更好的解决方案。
我发现使用一个自定义重试策略类来记录它正在做的事情很有帮助,这样我就可以清楚地看到它没有像我认为的那样工作。一旦我添加了这个,我就可以看到在有问题的情况下,它根本没有进行120次重试(每30秒一次(。那是我找到retryCapacityCondition的时候。自定义日志重试策略:
.overrideConfiguration(config ->
config.retryPolicy(
RetryPolicy.builder()
.retryCondition(new AlwaysRetryCondition())
.retryCapacityCondition(null)
.build()
).build()
)
private static class AlwaysRetryCondition implements RetryCondition {
private final RetryCondition defaultRetryCondition;
public AlwaysRetryCondition() {
defaultRetryCondition = RetryCondition.defaultRetryCondition();
}
@Override
public boolean shouldRetry(RetryPolicyContext context) {
String exceptionMessage = context.exception().getMessage();
Throwable cause = context.exception().getCause();
log.debug(
"S3 retry: shouldRetry retryCount=" + context.retriesAttempted()
+ " defaultRetryCondition=" + defaultRetryCondition.shouldRetry(context)
+ " httpstatus=" + context.httpStatusCode()
+ " " + context.exception().getClass().getSimpleName()
+ (cause != null ? " cause=" + cause.getClass().getSimpleName() : "")
+ " message=" + exceptionMessage
);
return true;
}
@Override
public void requestWillNotBeRetried(RetryPolicyContext context) {
log.debug("S3 retry: requestWillNotBeRetried retryCount=" + context.retriesAttempted());
}
@Override
public void requestSucceeded(RetryPolicyContext context) {
if (context.retriesAttempted() > 0) {
log.debug("S3 retry: requestSucceeded retryCount=" + context.retriesAttempted());
}
}
}
作为参考,这是我使用的配置:
NettyNioAsyncHttpClient.Builder asyncHttpClientBuilder = NettyNioAsyncHttpClient.builder()
.maxConcurrency(500)
.maxPendingConnectionAcquires(10000)
.connectionMaxIdleTime(Duration.ofSeconds(600))
.connectionTimeout(Duration.ofSeconds(20))
.connectionAcquisitionTimeout(Duration.ofSeconds(60))
.readTimeout(Duration.ofSeconds(120));
// Add retry behaviour
final long CLIENT_TIMEOUT_MILLIS = 600000;
final int NUMBER_RETRIES = 60;
final long RETRY_BACKOFF_MILLIS = 30000;
ClientOverrideConfiguration overrideConfiguration = ClientOverrideConfiguration.builder()
.apiCallTimeout(Duration.ofMillis(CLIENT_TIMEOUT_MILLIS))
.apiCallAttemptTimeout(Duration.ofMillis(CLIENT_TIMEOUT_MILLIS))
.retryPolicy(RetryPolicy.builder()
.numRetries(NUMBER_RETRIES)
.backoffStrategy(
FixedDelayBackoffStrategy.create(Duration.of(RETRY_BACKOFF_MILLIS, ChronoUnit.MILLIS))
)
.throttlingBackoffStrategy(BackoffStrategy.none())
.retryCondition(new AlwaysRetryCondition())
// retryCapacityCondition(null) fixes the rare s3-copy-errors
// this global max-retries was kicking in and preventing individual copy-requests from retrying
.retryCapacityCondition(null)
.build()
).build();
S3AsyncClient s3Client = S3AsyncClient.builder()
.credentialsProvider(awsCredentialsProvider)
.region(Region.US_EAST_1)
.httpClientBuilder(asyncHttpClientBuilder)
.overrideConfiguration(overrideConfiguration)
.build();