重新启动火花流应用程序的最佳方法是什么

>我基本上想在我的驱动程序中编写一个事件回调，它将在该事件到达时重新启动 Spark 流应用程序。我的驱动程序正在通过从文件中读取配置来设置流和执行逻辑。每当文件更改（添加新配置）时，驱动程序都必须按顺序执行以下步骤，

重新启动
读取配置文件（作为主方法的一部分）并
设置流

实现这一目标的最佳方法是什么？

在某些情况下，您可能希望动态地重新加载流上下文（例如重新加载流操作）。在这种情况下，您可以（Scala 示例）：

val sparkContext = new SparkContext()
val stopEvent = false
var streamingContext = Option.empty[StreamingContext]
val shouldReload = false
val processThread = new Thread {
  override def run(): Unit = {
    while (!stopEvent){
      if (streamingContext.isEmpty) {
        // new context
        streamingContext = Option(new StreamingContext(sparkContext, Seconds(1)))
        // create DStreams
          val lines = streamingContext.socketTextStream(...)
        // your transformations and actions
        // and decision to reload streaming context
        // ...
        streamingContext.get.start()
      } else {
        if (shouldReload) {
          streamingContext.get.stop(stopSparkContext = false, stopGracefully = true)
          streamingContext.get.awaitTermination()
          streamingContext = Option.empty[StreamingContext]
        } else {
          Thread.sleep(1000)
        }
      }
    }
    streamingContext.get.stop(stopSparkContext =true, stopGracefully = true)
    streamingContext.get.awaitTermination()
  }
}
// and start it  in separate thread
processThread.start()
processThread.join()

或者在 Python 中：

spark_context = SparkContext()
stop_event = Event()
spark_streaming_context = None
should_reload = False
def process(self):
    while not stop_event.is_set():
        if spark_streaming_context is None:
            # new context
            spark_streaming_context = StreamingContext(spark_context, 0.5)
            # create DStreams
            lines = spark_streaming_context.socketTextStream(...)  
            # your transformations and actions
            # and decision to reload streaming context
            # ...
            self.spark_streaming_context.start()
        else:
            # TODO move to config
            if should_reload:
                spark_streaming_context.stop(stopSparkContext=False, stopGraceFully=True)
                spark_streaming_context.awaitTermination()
                spark_streaming_context = None
            else:
                time.sleep(1)
    else:
        self.spark_streaming_context.stop(stopGraceFully=True)
        self.spark_streaming_context.awaitTermination()

# and start it  in separate thread
process_thread = threading.Thread(target=process)
process_thread.start()
process_thread.join()

如果要防止代码崩溃并从最后一个位置重新启动流式处理上下文，请使用检查点机制。它允许您在失败后还原作业状态。

重新启动Spark的最佳方法实际上是根据您的环境。但始终建议使用spark-submit控制台。

您可以像任何其他linux过程一样将spark-submit过程置于后台，方法是将其放入shell的背景中。在您的情况下，spark-submit 作业实际上随后在 YARN 上运行驱动程序，因此，它是照看已经通过 YARN 在另一台计算机上异步运行的进程。

云时代博客

我们最近探索的一种方式（在这里的Spark聚会中）是通过在Tandem and Spark中使用Zookeeper来实现这一点。简而言之，这使用 Apache Curator 来监视 Zookeeper 上的变化（ZK 配置的更改可以由您的外部事件触发），然后导致侦听器重新启动。

引用的代码库在这里，您会发现配置中的更改会导致观察者（一个火花流应用程序）在正常关闭并重新加载更改后重新启动。希望这是一个指针！

我目前正在按如下方式解决此问题，

通过订阅 MQTT 主题监听外部事件
在 MQTT 回调中，停止流上下文ssc.stop(true,true)这将正常关闭流和底层火花配置
通过创建 Spark 会议和通过读取配置文件设置流

// Contents of startSparkApplication() method
sparkConf = new SparkConf().setAppName("SparkAppName")
ssc = new StreamingContext(sparkConf, Seconds(1))
val myStream = MQTTUtils.createStream(ssc,...)   //provide other options
myStream.print()
ssc.start()

该应用程序构建为 Spring 启动应用程序

在 Scala 中，停止sparkStreamingContext可能涉及停止SparkContext。我发现当接收器挂起时，最好重新启动SparkCintext和SparkStreamingContext。

我相信下面的代码可以写得更优雅，但它允许以编程方式重新启动SparkContext和SparkStreamingContext。完成此操作后，您也可以以编程方式重新启动接收器。

    package coname.utilobjects
import com.typesafe.config.ConfigFactory
import grizzled.slf4j.Logging
import coname.conameMLException
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable

object SparkConfProviderWithStreaming extends Logging
{
  val sparkVariables: mutable.HashMap[String, Any] = new mutable.HashMap
}

trait SparkConfProviderWithStreaming extends Logging{



  private val keySSC = "SSC"
  private val keyConf = "conf"
  private val keySparkSession = "spark"

  lazy val   packagesversion=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.packagesversion")
  lazy val   sparkcassandraconnectionhost=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparkcassandraconnectionhost")
  lazy val   sparkdrivermaxResultSize=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparkdrivermaxResultSize")
  lazy val   sparknetworktimeout=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparknetworktimeout")

  @throws(classOf[conameMLException])
  def intitializeSpark(): Unit =
  {
    getSparkConf()
    getSparkStreamingContext()
    getSparkSession()
  }
  @throws(classOf[conameMLException])
  def getSparkConf(): SparkConf = {
    try {
      if (!SparkConfProviderWithStreaming.sparkVariables.get(keyConf).isDefined) {
        logger.info("nnLoading new confnn")
        val conf = new SparkConf().setMaster("local[4]").setAppName("MLPCURLModelGenerationDataStream")
        conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
        conf.set("spark.cassandra.connection.host", sparkcassandraconnectionhost)
        conf.set("spark.driver.maxResultSize", sparkdrivermaxResultSize)
        conf.set("spark.network.timeout", sparknetworktimeout)

        SparkConfProviderWithStreaming.sparkVariables.put(keyConf, conf)
        logger.info("Loaded new conf")
        getSparkConf()
      }
      else {
        logger.info("Returning initialized conf")
        SparkConfProviderWithStreaming.sparkVariables.get(keyConf).get.asInstanceOf[SparkConf]
      }
    }
    catch {
      case e: Exception =>
        logger.error(e.getMessage, e)
        throw new conameMLException(e.getMessage)
    }
  }
  @throws(classOf[conameMLException])
def killSparkStreamingContext
  {
    try
    {
      if(SparkConfProviderWithStreaming.sparkVariables.get(keySSC).isDefined)
        {
          SparkConfProviderWithStreaming.sparkVariables -= keySSC
          SparkConfProviderWithStreaming.sparkVariables -= keyConf
        }
      SparkSession.clearActiveSession()
      SparkSession.clearDefaultSession()
    }
    catch {
      case e: Exception =>
        logger.error(e.getMessage, e)
        throw new conameMLException(e.getMessage)
    }
  }
  @throws(classOf[conameMLException])
  def getSparkStreamingContext(): StreamingContext = {
    try {
      if (!SparkConfProviderWithStreaming.sparkVariables.get(keySSC).isDefined) {
        logger.info("nnLoading new streamingnn")
        SparkConfProviderWithStreaming.sparkVariables.put(keySSC, new StreamingContext(getSparkConf(), Seconds(6)))
        logger.info("Loaded streaming")
        getSparkStreamingContext()
      }
      else {
        SparkConfProviderWithStreaming.sparkVariables.get(keySSC).get.asInstanceOf[StreamingContext]
      }
    }
    catch {
      case e: Exception =>
        logger.error(e.getMessage, e)
        throw new conameMLException(e.getMessage)
    }
  }
  def getSparkSession():SparkSession=
  {
    if(!SparkSession.getActiveSession.isDefined)
    {
      SparkSession.builder.config(getSparkConf()).getOrCreate()
    }
    else
      {
        SparkSession.getActiveSession.get
      }
  }
}

相关内容

最新更新

热门标签：