Spark
结构化流的检查点目录创建四个子目录。他们每个人的用途是什么?
/warehouse/test_topic/checkpointdir1/commits
/warehouse/test_topic/checkpointdir1/metadata
/warehouse/test_topic/checkpointdir1/offsets
/warehouse/test_topic/checkpointdir1/sources
来自 StreamExecution 类文档:
/**
* A write-ahead-log that records the offsets that are present in each batch. In order to ensure
* that a given batch will always consist of the same data, we write to this log *before* any
* processing is done. Thus, the Nth record in this log indicated data that is currently being
* processed and the N-1th entry indicates which offsets have been durably committed to the sink.
*/
val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))
/**
* A log that records the batch ids that have completed. This is used to check if a batch was
* fully processed, and its output was committed to the sink, hence no need to process it again.
* This is used (for instance) during restart, to help identify which batch to run next.
*/
val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))
元数据日志用于与查询相关的信息。 例如,在KafkaSource中,它用于写入查询的起始偏移量(每个分区的偏移量(
源文件夹包含每个分区的初始 kafka 偏移值。就像如果你的 Kafka 有 3 个分区 1,2,3 并且每个分区的起始值为 0,那么它将包含类似 {1:0,2:0,3:0} 的值