使用Apache Beam DynamoDBIO从Dynamo读取特定记录



我有一个从DynamoDB读取数据的Apache Beam管道。为了读取数据,我使用Apache Beam DynamoDBIO SDK。我需要在我的用例中读取特定的/过滤数据,这意味着我必须在DynamoDBIO中使用filterExpression。我当前的代码如下,

Map<String, AttributeValue> expressionAttributeValues = new HashMap<>();
expressionAttributeValues.put(":message", AttributeValue.builder().s("Ping").build());
pipeline
.apply(DynamoDBIO.<List<Map<String, AttributeValue>>>read()
.withClientConfiguration(DynamoDBConfig.CLIENT_CONFIGURATION)
.withScanRequestFn(input -> ScanRequest.builder().tableName("SiteProductCache").totalSegments(1)
.filterExpression("KafkaEventMessage = :message")
.expressionAttributeValues(expressionAttributeValues)
.projectionExpression("key, KafkaEventMessage")
.build())
.withScanResponseMapperFn(new ResponseMapper())
.withCoder(ListCoder.of(MapCoder.of(StringUtf8Coder.of(), AttributeValueCoder.of())))
)
.apply(...)
----
static final class ResponseMapper implements SerializableFunction<ScanResponse, List<Map<String, AttributeValue>>> {
@Override
public List<Map<String, AttributeValue>> apply(ScanResponse input) {
if (input == null) {
return Collections.emptyList();
}
return input.items();
}
}

当执行代码时,我得到以下错误,

Exception in thread "main" java.lang.IllegalArgumentException: Forbidden IOException when writing to OutputStream
at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:89)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:70)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:55)
at org.apache.beam.sdk.transforms.Create$Values$CreateSource.fromIterable(Create.java:413)
at org.apache.beam.sdk.transforms.Create$Values.expand(Create.java:370)
at org.apache.beam.sdk.transforms.Create$Values.expand(Create.java:277)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:548)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:499)
at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:56)
at org.apache.beam.sdk.io.aws2.dynamodb.DynamoDBIO$Read.expand(DynamoDBIO.java:301)
at org.apache.beam.sdk.io.aws2.dynamodb.DynamoDBIO$Read.expand(DynamoDBIO.java:172)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:548)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:482)
at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:44)
at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:177)
at some_package.beam_state_storage.dynamodb.DynamoDBPipelineDefinition.run(DynamoDBPipelineDefinition.java:40)
at some_package.beam_state_storage.dynamodb.DynamoDBPipelineDefinition.main(DynamoDBPipelineDefinition.java:28)
Caused by: java.io.NotSerializableException: software.amazon.awssdk.core.util.DefaultSdkAutoConstructList
at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1197)
at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1582)
at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1539)
at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1448)
Caused by: java.io.NotSerializableException: software.amazon.awssdk.core.util.DefaultSdkAutoConstructList

有人知道如何解决这个问题,或者知道读取和过滤数据的正确方法吗?我对Apache Beam的东西有点陌生,并感谢任何指导。

我认为这里的问题是,您试图在lambda内部使用外部成员,为此,需要序列化父实例,但有些成员没有实现Serializable(类似于Apache Beam:由于PipelineOptions不可序列化,无法序列化DoFnWithExecutionInformation(。

也许expressionAttributeValues本身就是问题的原因,我不确定你的帖子中提到的DefaultSdkAutoConstructList是什么。

尝试用作用域良好的静态类替换lambda,或者如果可能,在lambda本身内部初始化expressionAttributeValues,而不必执行DoFn。

本文档将有助于理解此处的根本问题:https://beam.apache.org/documentation/programming-guide/#user-代码的可串行性。

相关内容

  • 没有找到相关文章

最新更新