如果为Spark数据帧的样本指定了Replacement,它会做什么



阅读spark文档:http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample

有这个布尔参数withReplacement,没有太多解释。

样本(带替换、分数、种子=无(

它是什么?我们如何使用它?

参数withReplacement控制sample结果的唯一性。如果我们将数据集视为一个球桶,withReplacement=true的意思是,从桶中取出一个随机球,然后将其放回桶中。这意味着,可以再次拾取同一个球。

假设数据集中的所有唯一元素:

  • withReplacement=true,同一元素可以作为sample的结果产生多次。

  • withReplacement=false,数据集的每个元素将仅被采样一次。

    import spark.implicits._
    val df = Seq(1, 2, 3, 5, 6, 7, 8, 9, 10).toDF("ids")
    df.show()
    df.sample(true, 0.5, 5)
    .show
    df.sample(false, 0.5, 5)
    .show
    

    结果

    +---+
    |ids|
    +---+
    |  1|
    |  2|
    |  3|
    |  5|
    |  6|
    |  7|
    |  8|
    |  9|
    | 10|
    +---+
    +---+
    |ids|
    +---+
    |  6|
    |  7|
    |  7|
    |  9|
    | 10|
    +---+
    +---+
    |ids|
    +---+
    |  1|
    |  3|
    |  7|
    |  8|
    |  9|
    +---+
    

这实际上在spark文档2.3版中提到过。https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample

带更换-带更换的样品

case class Member(id: Int, name: String, role: String)
val member1 = new Member(1, "User1", "Data Engineer")
val member2 = new Member(2, "User2", "Software Engineer")
val member3 = new Member(3, "User3", "DevOps Engineer")
val memberDF = Seq(member1, member2, member3).toDF
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name|             role|
+---+-----+-----------------+
|  1|User1|    Data Engineer|
|  2|User2|Software Engineer|
+---+-----+-----------------+
memberDF.sample(true, 0.4).show
+---+-----+---------------+
| id| name|           role|
+---+-----+---------------+
|  3|User3|DevOps Engineer|
+---+-----+---------------+
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name|             role|
+---+-----+-----------------+
|  2|User2|Software Engineer|
|  3|User3|  DevOps Engineer|
+---+-----+-----------------+

最新更新