按列分组,并使用value_counts字典PYSPARK创建一列



1

我有一张有客户购买历史的销售表。我想制作一个按客户分组的新数据帧。数据框架还应包括一列,其中包含客户购买的所有产品的value_counts字典以及每个产品的数量

这就是我的数据帧现在的样子

Description                                     Counts
CustomerID
3004000304    MAJOR APPLIANCES,HOME OFFICE, OTHER STUFF          3
3004000304    HOME OFFICE, MAJOR APPLIANCES                      2
3004000304    ACCESSORIES, OTHER STUFF                           2
3004002756    MAJOR APPLIANCES, ACCESSORIES                      2
3004002946    HOME OFFICE, HOME OFFICE                           2
3004002946    ACCESSORIES, MAJOR APPLIANCES                      2
3004002946    MAJOR APPLIANCES, OTHER STUFF, ACCESSORIES         3 

我想要这个

Counts
CustomerID
3004000304    {'MAJOR APPLIANCES': 2, 'HOME OFFICE': 2, 'ACCESSORIES': 1, 'OTHER STUFF':2}
3004002756    {'MAJOR APPLIANCES': 1, 'ACCESSORIES': 1}
3004002946    {'HOME OFFICE': 2, 'ACCESSORIES': 2, 'MAJOR APPLIANCES': 1,'OTHER STUFF':1}

我的方法我将PYSPARK DF转换为Pandas-DF,因为我对PYSPARK没有太多经验,而且是新手。如果能得到PYSPARK的直接帮助,我将不胜感激,否则即使是Pandas的实现也会很棒。

f = lambda x: dict(zip(x['Description'], x['Counts']))
df = categories.groupby(level=0).apply(f).to_frame('Counts')
print (df)

这并没有给我预期的结果

对于Pyspark解决方案,您可以参考这个SO答案,

代码将作为

df.groupBy("CustomerID").agg(
F.map_from_entries(
F.collect_list(
F.struct("Description", "Counts"))).alias("Description_Counts")
).show(truncate=False)

请注意,map_from_entries仅适用于spark版本>=2.4.0

编辑-根据注释,你可以使用下面的代码,我首先分解列中的每个值,然后根据它们的计数对它们进行分组

dft=spark.createDataFrame([(3004000304,"MAJOR APPLIANCES, HOME OFFICE, OTHER STUFF",3),(3004000304,"HOME OFFICE, MAJOR APPLIANCES",2),(3004000304,"ACCESSORIES, OTHER STUFF",2),(3004002756,"MAJOR APPLIANCES, ACCESSORIES",2),(3004002946,"HOME OFFICE, HOME OFFICE",2),(3004002946,"ACCESSORIES, MAJOR APPLIANCES",2),(3004002946,"MAJOR APPLIANCES, OTHER STUFF, ACCESSORIES",3)],"cid long, descr string, count int")
dft.withColumn("descr",F.explode(F.split("descr",", "))).groupBy("cid","descr").agg(F.count("descr").alias("count")).groupBy("cid").agg(
F.map_from_entries(
F.collect_list(
F.struct("descr", "count"))).alias("Description_Counts")
).show(truncate=False)
# output
+----------+-----------------------------------------------------------------------------+
|cid       |Description_Counts                                                           |
+----------+-----------------------------------------------------------------------------+
|3004002946|{HOME OFFICE -> 2, MAJOR APPLIANCES -> 2, ACCESSORIES -> 2, OTHER STUFF -> 1}|
|3004000304|{OTHER STUFF -> 2, MAJOR APPLIANCES -> 2, HOME OFFICE -> 2, ACCESSORIES -> 1}|
|3004002756|{ACCESSORIES -> 1, MAJOR APPLIANCES -> 1}                                    |
+----------+-----------------------------------------------------------------------------+

对于Pandas版本,您可以执行:

from collections import Counter
df['Desc'] = df['Description'].apply(lambda x:x.split(","))
df['Desc'] = df['Desc'].apply(lambda x:[e.strip() for e in x])
df2 = df.groupby('CustomerID')['Desc'].agg(sum)
df2['Counts'] = df2['Desc'].apply(lambda x:Counter(x))

相关内容

  • 没有找到相关文章

最新更新