阅读spark文档:http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
有这个布尔参数withReplacement
,没有太多解释。
样本(带替换、分数、种子=无(
它是什么?我们如何使用它?
withReplacement
控制sample
结果的唯一性。如果我们将数据集视为一个球桶,withReplacement=true
的意思是,从桶中取出一个随机球,然后将其放回桶中。这意味着,可以再次拾取同一个球。
假设数据集中的所有唯一元素:
withReplacement=true
,同一元素可以作为sample
的结果产生多次。withReplacement=false
,数据集的每个元素将仅被采样一次。import spark.implicits._ val df = Seq(1, 2, 3, 5, 6, 7, 8, 9, 10).toDF("ids") df.show() df.sample(true, 0.5, 5) .show df.sample(false, 0.5, 5) .show
结果
+---+ |ids| +---+ | 1| | 2| | 3| | 5| | 6| | 7| | 8| | 9| | 10| +---+ +---+ |ids| +---+ | 6| | 7| | 7| | 9| | 10| +---+ +---+ |ids| +---+ | 1| | 3| | 7| | 8| | 9| +---+
这实际上在spark文档2.3版中提到过。https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
带更换-带更换的样品
case class Member(id: Int, name: String, role: String)
val member1 = new Member(1, "User1", "Data Engineer")
val member2 = new Member(2, "User2", "Software Engineer")
val member3 = new Member(3, "User3", "DevOps Engineer")
val memberDF = Seq(member1, member2, member3).toDF
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 1|User1| Data Engineer|
| 2|User2|Software Engineer|
+---+-----+-----------------+
memberDF.sample(true, 0.4).show
+---+-----+---------------+
| id| name| role|
+---+-----+---------------+
| 3|User3|DevOps Engineer|
+---+-----+---------------+
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 2|User2|Software Engineer|
| 3|User3| DevOps Engineer|
+---+-----+-----------------+