我正在为一个火花方法编写单元测试,该测试将多个数据帧作为输入参数并返回一个数据框架。火花方法的代码如下:
class processor {
def process(df1: DataFrame, df2: DataFrame): DataFrame = {
// process and return resulting data frame
}
}
相应单元测试的现有代码如下:
import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.DataFrame
import org.scalatest.{FlatSpec, Matchers}
class TestProcess extends FlatSpec with DataFrameSuiteBase with Matchers {
val p:Processor = new Processor
"process()" should "return only one row" in {
df1RDD = sc.parallelize(
Seq("a", 12, 98999),
Seq("b", 42, 99)
)
df1DF = spark.createDataFrame(df1RDD).toDF()
df2RDD = sc.parallelize(
Seq("X", 12, "foo", "spark"),
Seq("Z", 42, "bar", "storm")
)
df2DF = spark.createDataFrame(df2RDD).toDF()
val result = p.process(df1, df2)
}
it should "return spark row" in {
df1RDD = sc.parallelize(
Seq("a", 12, 98999),
Seq("b", 42, 99)
)
df1DF = spark.createDataFrame(df1RDD).toDF()
df2RDD = sc.parallelize(
Seq("X", 12, "foo", "spark"),
Seq("Z", 42, "bar", "storm")
)
df2DF = spark.createDataFrame(df2RDD).toDF()
val result = p.process(df1, df2)
}
}
此代码正常运行,但是在每个测试方法中都重复创建RDD和DF的代码存在问题。当我尝试创建RDD外部测试方法或beforeandafterall()方法中时,我会出现有关sc
的错误。似乎Spark Testing Base
库启动sc
和spark
仅在测试方法内部的变量。
我想知道是否有任何方法可以避免编写此重复代码?
使用WordSpec
后更新代码而不是使用FlatSpec
import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.DataFrame
import org.scalamock.scalatest.MockFactory
import org.scalatest.{Matchers, WordSpec}
class TestProcess extends WordSpec with DataFrameSuiteBase with Matchers {
val p:Processor = new Processor
"process()" should {
df1RDD = sc.parallelize(
Seq("a", 12, 98999),
Seq("b", 42, 99)
)
df1DF = spark.createDataFrame(df1RDD).toDF()
df2RDD = sc.parallelize(
Seq("X", 12, "foo", "spark"),
Seq("Z", 42, "bar", "storm")
)
df2DF = spark.createDataFrame(df2RDD).toDF()
val result = p.process(df1, df2)
"return only one row" in {
result.count should equal(1)
}
"return spark row" in {
// assertions to check if 'row' containing 'spark' in last column is in the result or not
}
}
}
使用 WordSpec
代替 FlatSpec
,因为它允许在测试子句之前分组常见的初始化,如
"process()" should {
df1RDD = sc.parallelize(Seq("a", 12, 98999),Seq("b", 42, 99))
df1DF = spark.createDataFrame(df1RDD).toDF()
df2RDD = sc.parallelize(Seq("X", 12, "foo", "spark"), Seq("Z", 42, "bar", "storm"))
df2DF = spark.createDataFrame(df2RDD).toDF()
"return only one row" in {
....
}
"return spark row" in {
....
}
}
编辑:另外,以下两行代码几乎无法使用库(Spark-Testing Base)证明:
val spark = SparkSession.builder.master("local[1]").getOrCreate
val sc = spark.sparkContext
将它们添加到班级的顶部,并且您都将SparkContext和ALL都设置为NPE。
编辑:我刚刚通过自己的测试确认了Spark-Testing Base 与WordsPec效果不佳。如果您仍然想使用它,请考虑与图书馆作者打开错误报告,因为这绝对是Spark测试基础的问题。