如何将流式处理数据帧转换为常规批处理数据帧

我想将流数据帧转换为普通数据帧以进行大量操作：计数.不同用于实时分析的复杂查询。

如果您对在 Spark 中将流数据帧转换为普通数据帧有任何想法，请提出建议。

我认为最好的选择是编写一个自定义流式接收器，并可以访问addBatch中每个批次的DataFrame。

Sink相当短，所以在这里引用 scaladoc 和代码。

/**
 * An interface for systems that can collect the results of a streaming query. In order to preserve
 * exactly once semantics a sink must be idempotent in the face of multiple attempts to add the same
 * batch.
 */
trait Sink {
  /**
   * Adds a batch of data to this sink. The data for a given `batchId` is deterministic and if
   * this method is called more than once with the same batchId (which will happen in the case of
   * failures), then `data` should only be added once.
   *
   * Note 1: You cannot apply any operators on `data` except consuming it (e.g., `collect/foreach`).
   * Otherwise, you may get a wrong result.
   *
   * Note 2: The method is supposed to be executed synchronously, i.e. the method should only return
   * after data is consumed by sink successfully.
   */
  def addBatch(batchId: Long, data: DataFrame): Unit
}

另请阅读 StreamSinkProvider。

将流数据帧保存到本地或 kafka。并从本地或 kafka 以批处理模式读取它。

相关内容

最新更新

热门标签：