Kafka KSQL - 由于超时而无法创建流



我在Google的K8S上运行3个KSQL服务器,面向Kafka(通过ssl)和托管在Google Cloud VM上的Zookeeper。我可以很容易地创建5个流,它们工作得很好,但超过5个就会出现各种超时。

KSQL kubernetes配置(KSQL vars部分):

- name: KSQL_CONFIG_DIR
value: "/etc/ksqldb"
- name: KSQL_LOG4J_OPTS
value: "-Dlog4j.configuration=file:/etc/ksqldb/log4j.properties"
- name: KSQL_BOOTSTRAP_SERVERS
value: ***:9092,***:9092,***:9092
- name: KSQL_KSQL_INTERNAL_TOPIC_REPLICAS
value: "3"
- name: KSQL_KSQL_SCHEMA_REGISTRY_URL
value: "http://***"
- name: KSQL_HOST_NAME
value: prod-ksqldb-server
- name: KSQL_KSQL_SERVICE_ID
value: "prod-ksqldb-server"
- name: KSQL_LISTENERS
value: "http://0.0.0.0:8088"
- name: KSQL_CACHE_MAX_BYTES_BUFFERING
value: "0"
- name: KSQL_SECURITY_PROTOCOL
value: SSL
- name: KSQL_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM
value: ""
- name: KSQL_SSL_TRUSTSTORE_LOCATION
value: /truststore/kafka.truststore.jks
- name: KSQL_SSL_TRUSTSTORE_PASSWORD
valueFrom:
secretKeyRef:
name: kafkassl
key: truststore_password
- name: KSQL_SSL_KEYSTORE_LOCATION
value: /keystore/kafkaconnect.keystore.jks
- name: KSQL_SSL_KEYSTORE_PASSWORD
valueFrom:
secretKeyRef:
name: kafkassl
key: keystore_password

我得到以下错误(当我尝试创建流,删除流或描述流扩展时随机发生的一些错误):

Timeout while initializing transaction to the KSQL command topic.
If you're running a single Kafka broker, ensure that the following configs are set to 1 on the broker:
- transaction.state.log.replication.factor
- transaction.state.log.min.isr
- offsets.topic.replication.factor
Caused by: Timeout expired after 60000 milliseconds while awaiting
InitProducerId
Failed to guarantee existence of topic ABC
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting
for a node assignment. Call: createTopics
Caused by: Timed out waiting for a node assignment. Call: createTopics
Failed to Describe Kafka Topic(s): [source_topic]
Caused by: Timed out waiting to send the call. Call: describeTopics
ksql> describe ABC extended;
[2021-08-27 11:04:05,458] ERROR Failed to list Kafka consumer groups offsets
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting
for a node assignment. Call: listConsumerGroupOffsets
Caused by: Timed out waiting for a node assignment. Call:
listConsumerGroupOffsets (io.confluent.ksql.cli.console.Console:344)

这是我在KSQL日志中发现的:

2021-08-27 13:53:34.055 CEST
[2021-08-27 11:53:34,054] INFO Retrying request. Retry no: 0 Cause: 'org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: findCoordinator' (io.confluent.ksql.util.ExecutorUtil:95) 

我们检查了Kafka日志,Zookeeper日志(我怀疑由于getTopics问题存在一些问题),重新启动了两者。不知道怎么了。

就像我说的,Kafka代理本身工作得很好,Kafka Connect也在工作,没有问题发生…

Github issue: https://github.com/confluentinc/ksql/issues/7953

编辑:

  1. 检查网络,似乎没有问题。

    平10.164.0.30从10.164.0.30开始64字节:seq=811 ttl=63 time=0.354 ms从10.164.0.30开始64字节:seq=812 ttl=63 time=0.277 msC ^——10.164.0.30 ping statistics——发送813包,接收813包,丢包率为0%往返min/avg/max = 0.157/0.275/1.549 ms

  2. Java内存为3gb,根据GKE仪表板使用内存为1.5G。

还有,今天的情况:我今天试着创建了一个流。得到响应:Statement written to command topic。当我运行SHOW STREAMS时,我得到

[2021-08-30 09:49:04,117] ERROR Timed out while waiting for a previous command to execute. command sequence number: 10 (io.confluent.ksql.cli.console.Console:344)
Error: command not executed since the server timed out while waiting for prior commands to finish executing.
If you wish to execute new commands without waiting for prior commands to finish, run the command 'request-pipelining ON'.
Timed out while waiting for a previous command to execute. command sequence number: 10

每次尝试后都会出现。当我关闭ksql-cli会话并运行show streams时,错误不存在,但我创建的流不存在(可能在后台创建?)

超时可能有不同的原因,首先是kafka和ksql之间的延迟问题,这可能是由于kafka的网络延迟或CPU瓶颈导致ksql的生产者线程等待返回超过60000毫秒,或者等待ksql的消费者心跳响应导致消费者再平衡。第二个原因是授权问题。如果你正在使用zookeeper acl,你必须确保ksql用户被允许在所有集群上有CREATE和descripbecconfigurations授权,在_confluant -ksql主题和组上有all授权,在你的开发中使用的主题上有write和read。

我通过使用ksql.queries.file并在Headless模式下运行它找到了一个解决方案。它适合我,因为我真的不需要交互式版本,但它不能解决命令主题的问题。

最新更新