使用Spark Testing Base库创建Spark DataFrames的最佳方法是什么?



我正在为一个火花方法编写单元测试,该测试将多个数据帧作为输入参数并返回一个数据框架。火花方法的代码如下:

class processor {
    def process(df1: DataFrame, df2: DataFrame): DataFrame = {
      // process and return resulting data frame
    }
}

相应单元测试的现有代码如下:

import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.DataFrame
import org.scalatest.{FlatSpec, Matchers}
class TestProcess extends FlatSpec with DataFrameSuiteBase with Matchers {
  val p:Processor = new Processor
  "process()" should "return only one row" in {
    df1RDD = sc.parallelize(
      Seq("a", 12, 98999),
      Seq("b", 42, 99)
    )
   df1DF = spark.createDataFrame(df1RDD).toDF()
    df2RDD = sc.parallelize(
      Seq("X", 12, "foo", "spark"),
      Seq("Z", 42, "bar", "storm")
    )
   df2DF = spark.createDataFrame(df2RDD).toDF()
  val result = p.process(df1, df2)
  }
  it should "return spark row" in {
    df1RDD = sc.parallelize(
      Seq("a", 12, 98999),
      Seq("b", 42, 99)
    )
   df1DF = spark.createDataFrame(df1RDD).toDF()
    df2RDD = sc.parallelize(
      Seq("X", 12, "foo", "spark"),
      Seq("Z", 42, "bar", "storm")
    )
   df2DF = spark.createDataFrame(df2RDD).toDF()
  val result = p.process(df1, df2)
  }
}

此代码正常运行,但是在每个测试方法中都重复创建RDD和DF的代码存在问题。当我尝试创建RDD外部测试方法或beforeandafterall()方法中时,我会出现有关sc的错误。似乎Spark Testing Base库启动scspark仅在测试方法内部的变量。

我想知道是否有任何方法可以避免编写此重复代码?


使用WordSpec后更新代码而不是使用FlatSpec

import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.DataFrame
import org.scalamock.scalatest.MockFactory
import org.scalatest.{Matchers, WordSpec}
class TestProcess extends WordSpec with DataFrameSuiteBase with Matchers {
  val p:Processor = new Processor
  "process()" should {
    df1RDD = sc.parallelize(
        Seq("a", 12, 98999),
        Seq("b", 42, 99)
      )
    df1DF = spark.createDataFrame(df1RDD).toDF()
    df2RDD = sc.parallelize(
        Seq("X", 12, "foo", "spark"),
        Seq("Z", 42, "bar", "storm")
    )
    df2DF = spark.createDataFrame(df2RDD).toDF()
    val result = p.process(df1, df2)
    "return only one row" in {             
      result.count should equal(1)
    }
    "return spark row" in {
      // assertions to check if 'row' containing 'spark' in last column is in the result or not
    }
  }
}

使用 WordSpec代替 FlatSpec,因为它允许在测试子句之前分组常见的初始化,如

"process()" should {
     df1RDD = sc.parallelize(Seq("a", 12, 98999),Seq("b", 42, 99))
     df1DF = spark.createDataFrame(df1RDD).toDF()
     df2RDD = sc.parallelize(Seq("X", 12, "foo", "spark"), Seq("Z", 42, "bar", "storm"))
     df2DF = spark.createDataFrame(df2RDD).toDF()
     "return only one row" in {
         ....
     }
     "return spark row" in {
         ....
     }
}

编辑:另外,以下两行代码几乎无法使用库(Spark-Testing Base)证明:

val spark = SparkSession.builder.master("local[1]").getOrCreate
val sc = spark.sparkContext

将它们添加到班级的顶部,并且您都将SparkContext和ALL都设置为NPE。

编辑:我刚刚通过自己的测试确认了Spark-Testing Base 与WordsPec效果不佳。如果您仍然想使用它,请考虑与图书馆作者打开错误报告,因为这绝对是Spark测试基础的问题。

相关内容

  • 没有找到相关文章

最新更新