我是kafka的新手,目前正在研究Kafka Streams,尤其是加入两个流。
我浏览的示例处理了相当简单的消息/文本消息。 所以我构建了另一个简单的样本,它更适用于传统的ETL。 假设我们有两个"数据集":合约(=Vertrag)和现金流,基数为1到n。
在我的示例中,我为每个主题创建了一个主题,并为每个对象发送了对象(Vertrag、现金流)。
我设法第一次加入他们。
KStream<String, String> joined = srcVertrag.leftJoin(srcCashflow,
(leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue, /* ValueJoiner */
JoinWindows.of(5000),
Joined.with(
Serdes.String(), /* key */
Serdes.String(), /* left value */
Serdes.String()) /* right value */
);
结果如下所示:
left={"name":"Vertrag123","vertragId":"123"}, right={"buchungstag":1560715764709,"betrag":12.0,"vertragId":"123"}
现在我的问题:
- 这是正确的方法吗?
- 我应该创建对象还是只处理字符串?
经过您的提示和进一步研究,我想出了以下测试。 - 我为"Vertrag"和"Cashflow"创建了Pojos - 我为每个人创造了塞尔德斯 - 我将它们作为对象流式传输 - 最后,我尝试将它们加入到包装类中。(我挂在这里)
我找不到像这样的事情的样本。这有这么奇特吗?
package tki.bigdata.kafkaetl;
import java.time.Duration;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;
import org.apache.kafka.common.serialization.Deserializer;
import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.common.serialization.Serializer;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.JoinWindows;
import org.apache.kafka.streams.kstream.Joined;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.Printed;
import org.apache.kafka.streams.kstream.ValueJoiner;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.scheduling.annotation.EnableScheduling;
import tki.bigdata.domain.Cashflow;
import tki.bigdata.domain.Vertrag;
import tki.bigdata.serde.JsonPOJODeserializer;
import tki.bigdata.serde.JsonPOJOSerializer;
@ComponentScan(basePackages = { "tki.bigdata.domain", "tki.bigdata.config", "tki.bigdata.app" }, basePackageClasses = App.class)
@SpringBootApplication
@EnableScheduling
public class App implements CommandLineRunner {
private static String bootstrapServers = "tobi0179.westeurope.cloudapp.azure.com:9092";
@Autowired
private KafkaTemplate<String, Object> template;
// @Autowired
// ExcelReader excelReader;
public static void main(String[] args) {
SpringApplication.run(App.class, args).close();
}
private void populateSampleData() {
Vertrag v = new Vertrag();
v.setVertragId("123");
v.setName("Vertrag123");
template.send("Vertrag", "123", v);
//template.send("Vertrag", "124", "124;Vertrag12");
Cashflow c = new Cashflow();
c.setVertragId("123");
c.setBetrag(12);
c.setBuchungstag(new Date());
template.send("Cashflow", "123", c);
}
//@Override
public void run(String... args) throws Exception {
// Topics mit Demodata befüllen
populateSampleData();
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-pipe");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
// TODO: the following can be removed with a serialization factory
Map<String, Object> serdeProps = new HashMap<>();
// prepare Serde for Vertrag
final Serializer<Vertrag> vertragSerializer = new JsonPOJOSerializer<Vertrag>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
vertragSerializer.configure(serdeProps, false);
final Deserializer<Vertrag> vertragDeserializer = new JsonPOJODeserializer<Vertrag>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
vertragDeserializer.configure(serdeProps, false);
final Serde<Vertrag> vertragSerde = Serdes.serdeFrom(vertragSerializer, vertragDeserializer);
// prepare Serde for Cashflow
final Serializer<Cashflow> cashflowSerializer = new JsonPOJOSerializer<Cashflow>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
cashflowSerializer.configure(serdeProps, false);
final Deserializer<Cashflow> cashflowDeserializer = new JsonPOJODeserializer<Cashflow>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
cashflowDeserializer.configure(serdeProps, false);
final Serde<Cashflow> cashflowSerde = Serdes.serdeFrom(cashflowSerializer, cashflowDeserializer);
// streamsConfiguration.put(StreamsConfig.STATE_DIR_CONFIG,
// TestUtils.tempDir().getAbsolutePath());
StreamsBuilder builder = new StreamsBuilder();
KStream<String, Vertrag> srcVertrag = builder.stream("Vertrag");
KStream<String, Cashflow> srcCashflow = builder.stream("Cashflow");
// print to sysout
//srcVertrag.print(Printed.toSysOut());
KStream<String, MyValueContainer> joined = srcVertrag.leftJoin(srcCashflow,
(leftValue, rightValue) -> new MyValueContainer(leftValue , rightValue), /* ValueJoiner */
JoinWindows.of(600),
Joined.with(
Serdes.String(), /* key */
vertragSerde, /* left value */
cashflowSerde) /* right value */
);
joined.to("Output");
final Topology topology = builder.build();
System.out.println(topology.describe());
final KafkaStreams streams = new KafkaStreams(topology, streamsConfiguration);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
@Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
}
执行时,它会产生错误:
2019-06-17 22:18:31.892 ERROR 1599 --- [-StreamThread-1] o.a.k.s.p.i.AssignedStreamsTasks : stream-thread [streams-pipe-0638d359-94df-43bd-9ef7-eb6769ed8a1c-StreamThread-1] Failed to process stream task 0_0 due to the following error:
java.lang.ClassCastException: java.lang.String cannot be cast to tki.bigdata.domain.Vertrag
at org.apache.kafka.streams.kstream.internals.KStreamKStreamJoin$KStreamKStreamJoinProcessor.process(KStreamKStreamJoin.java:98) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:126) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.kstream.internals.KStreamJoinWindow$KStreamJoinWindowProcessor.process(KStreamJoinWindow.java:63) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:129) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:87) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:302) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:409) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:964) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:832) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736) [kafka-streams-2.0.1.jar:na]
正确的方法吗?
是的。
我应该创建对象还是只处理字符串?
是的。将 Avro 视为用于序列化/反序列化 pojos 的数据格式的一个很好的例子。在这里,您正在寻找Avro"serde"(序列化器/反序列化器)。 例如,Confluent为KStreams提供了这样的Avro serde(这个serde需要使用Confluent Schema Registry)。
我应该如何处理上述结果?
我不清楚你的问题是什么。