PySpark如何按用户分组并以正采样率和负采样率对其进行采样



我有一个数据帧,其中阳性率和阴性率小于1:100,我想按每个用户的阳性率和阳性率1:5随机采样。

例如:

userid     label    date 
0          1        0708
0          0        0703
1          1        0702
0          0        0701
1          1        0700
1          0        0704
1          0        0705
0          0        0706
1          0        0708
0          0        0710
1          0        0711
0          0        0713
0          0        0714
0          0        0715
0          0        0717
0          0        0718
1          0        0711
1          0        0722
1          0        0715
..., ...
..., ...
# after random sample it in a positive and negative sample rates
userid     label    date 
0          1        0708
0          0        0703
0          0        0701
0          0        0715
0          0        0717
0          0        0718
1          1        0702
1          0        0704
1          0        0705
1          0        0711
1          0        0722
1          0        0715

有人能帮我,给我一些提示吗?提前谢谢。

假设您的起始数据帧称为df

from pyspark.sql import Window
from pyspark.sql.functions import col
import pyspark.sql.functions as F
#Segregate into Positive n negative 
df_0=df.filter(df.label == 0)
df_1=df.filter(df.label == 1)
#Create a window groups together records of same userid with random order
window_random = Window.partitionBy(col('userid')).orderBy(F.rand())
# For Negative Dataframe , rank and choose rank <= 5
data_0 = df_0.withColumn('rank', F.rank().over(window_random)).filter(F.col('rank') <= 5).drop('rank')
# For Positive Dataframe , rank and choose rank <= 1
data_1 = df_1.withColumn('rank', F.rank().over(window_random)).filter(F.col('rank') <= 1).drop('rank')
#Finally union both results 
final_result = data_1.union(data_0)

我发现您的样品太小,无法提供定量。我会试试SampleBy

#创建分数列

frac = df.select("label").distinct().withColumn("frac", F.when(col('label')=='1',lit(0.1)).otherwise(lit(0.5))).rdd.collectAsMap()
print(frac)

#采样

sampled_df = df.sampleBy("label", frac, seed=3)
sampled_df.show()
+------+-----+----+
|userid|label|date|
+------+-----+----+
|     0|    0| 701|
|     1|    1| 700|
|     1|    0| 704|
|     0|    0| 706|
|     0|    0| 710|
|     1|    0| 711|
|     0|    0| 713|
|     0|    0| 715|
|     0|    0| 718|
|     1|    0| 711|
+------+-----+----+

最新更新