使用pyspark插补负值的有效代码



我正在处理一个数据集,该数据集包含有关该特定商品的销售数量的项目-日期信息。然而,我打算估算的"售出数量"栏中有一些负值。这里使用的逻辑是用日期级别的每件商品的销售数量模式来代替这些负值。我已经计算了售出数量的每个不同价值,并获得了特定物品在每个给定日期的最大售出数量。然而,我找不到一个函数,可以将负值替换为每个商品的最大销售数量*日期组合。我相对来说比pyspark新。在这种情况下,哪种方法是最好的?

根据您提供的有限信息,您可以尝试以下操作-


from pyspark import SparkContext
from pyspark.sql import SQLContext
from functools import reduce
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext.getOrCreate()
sql = SQLContext(sc)
input_list = [
(1,10,"2019-11-07")
,(1,5,"2019-11-07")
,(1,5,"2019-11-07")
,(1,5,"2019-11-08")
,(1,6,"2019-11-08")
,(1,7,"2019-11-09")
,(1,7,"2019-11-09")
,(1,8,"2019-11-09")
,(1,8,"2019-11-09")
,(1,8,"2019-11-09")
,(1,-10,"2019-11-09")
,(2,10,"2019-11-07")
,(2,3,"2019-11-07")
,(2,9,"2019-11-07")
,(2,9,"2019-11-08")
,(2,-10,"2019-11-08")
,(2,5,"2019-11-09")
,(2,5,"2019-11-09")
,(2,2,"2019-11-09")
,(2,2,"2019-11-09")
,(2,2,"2019-11-09")
,(2,-10,"2019-11-09")
]

sparkDF = sql.createDataFrame(input_list,['product_id','sold_qty','date'])
sparkDF = sparkDF.withColumn('date',F.to_date(F.col('date'), 'yyyy-MM-dd'))

模式实施

#### Mode Implemention 
modeDF = sparkDF.groupBy('date', 'sold_qty')
.agg(F.count(F.col('sold_qty')).alias('mode_count'))
.select(F.col('date'),F.col('sold_qty').alias('mode_sold_qty'),F.col('mode_count'))
window = Window.partitionBy("date").orderBy(F.desc("mode_count"))
#### Filtering out the most occurred value 
modeDF = modeDF
.withColumn('order', F.row_number().over(window))
.where(F.col('order') == 1)

与基本数据帧合并以估算

sparkDF = sparkDF.join(modeDF
,sparkDF['date'] == modeDF['date']
,'inner'
).select(sparkDF['*'],modeDF['mode_sold_qty'],modeDF['mode_count'])

sparkDF = sparkDF.withColumn('imputed_sold_qty',F.when(F.col('sold_qty') < 0,F.col('mode_sold_qty'))
.otherwise(F.col('sold_qty')))

>>> sparkDF.show(100)
+----------+--------+----------+-------------+----------+----------------+      
|product_id|sold_qty|      date|mode_sold_qty|mode_count|imputed_sold_qty|
+----------+--------+----------+-------------+----------+----------------+
|         1|       7|2019-11-09|            2|         3|               7|
|         1|       7|2019-11-09|            2|         3|               7|
|         1|       8|2019-11-09|            2|         3|               8|
|         1|       8|2019-11-09|            2|         3|               8|
|         1|       8|2019-11-09|            2|         3|               8|
|         1|     -10|2019-11-09|            2|         3|               2|
|         2|       5|2019-11-09|            2|         3|               5|
|         2|       5|2019-11-09|            2|         3|               5|
|         2|       2|2019-11-09|            2|         3|               2|
|         2|       2|2019-11-09|            2|         3|               2|
|         2|       2|2019-11-09|            2|         3|               2|
|         2|     -10|2019-11-09|            2|         3|               2|
|         1|       5|2019-11-08|            9|         1|               5|
|         1|       6|2019-11-08|            9|         1|               6|
|         2|       9|2019-11-08|            9|         1|               9|
|         2|     -10|2019-11-08|            9|         1|               9|
|         1|      10|2019-11-07|            5|         2|              10|
|         1|       5|2019-11-07|            5|         2|               5|
|         1|       5|2019-11-07|            5|         2|               5|
|         2|      10|2019-11-07|            5|         2|              10|
|         2|       3|2019-11-07|            5|         2|               3|
|         2|       9|2019-11-07|            5|         2|               9|
+----------+--------+----------+-------------+----------+----------------+

最新更新