Spark Streaming-数据帧收集性能问题



我正在努力改进spark流应用程序以获得更好的性能。在每个流周期中,我为主题中的每个记录生成一个新的数据帧,我需要从这个数据帧中收集值列表,以便在分析模型阶段使用。

以下是我的申请步骤:

1- Read from kafka
For Loop
2- Generate a new dataframe by joining static dataframe with new topic dataframe (Columns : key,value)
3- Collect value list from dataframe. (CollectDF function)
4- Calling pmml model
...
2- Generate a new dataframe by joining static dataframe with new topic dataframe (Columns : key,value)
3- Collect value list from dataframe. (CollectDF function)
4- Calling pmml model
...

If there are 10 record in topic, this cycle is runing 10 times. At first, CollectDF process takes 1-2 seconds but after a few cycle in the loop, this process takes 8-10 seconds.
Actually i dont understand how this is possible. How can i keep the process time stable ?

kafkaStream.foreachRDD(rdd => {
stream_df.collect().foreach { row =>
...
val model_feature_list = CollectDF(df_model)
val predictions = model.predict(model_feature_list)
}
}

def CollectDF(df_modelparam : DataFrame): Array[Int] ={
val x : Map[String, Int] = df_modelparam.collect.map( r => {
val key = r(0).toString
val value = r(1).toString.toInt
(key -> value)
}
).toMap.toSortedMap
var x_arr = x.values.toArray
x_arr
}   

提前感谢

我可以知道收集数据到驱动程序的原因吗?

理想情况下,您应该尽量避免在火花流用例中使用collect()函数,因为这是一项成本高昂的操作,可能会减慢速度。

也许你可以在流式数据帧本身上尝试下面这样的方法,而不是将数据收集到驱动程序中。

streamingDF.mapPartitions(rowIterator=>{
rowIterator.foreach(row =>{
val key = row(0).toString
val value = row(1).toString.toInt
(key -> value)
// analytical use case on the above key, value being created
}
}

最新更新