1
我有一张有客户购买历史的销售表。我想制作一个按客户分组的新数据帧。数据框架还应包括一列,其中包含客户购买的所有产品的value_counts字典以及每个产品的数量
这就是我的数据帧现在的样子
Description Counts
CustomerID
3004000304 MAJOR APPLIANCES,HOME OFFICE, OTHER STUFF 3
3004000304 HOME OFFICE, MAJOR APPLIANCES 2
3004000304 ACCESSORIES, OTHER STUFF 2
3004002756 MAJOR APPLIANCES, ACCESSORIES 2
3004002946 HOME OFFICE, HOME OFFICE 2
3004002946 ACCESSORIES, MAJOR APPLIANCES 2
3004002946 MAJOR APPLIANCES, OTHER STUFF, ACCESSORIES 3
我想要这个
Counts
CustomerID
3004000304 {'MAJOR APPLIANCES': 2, 'HOME OFFICE': 2, 'ACCESSORIES': 1, 'OTHER STUFF':2}
3004002756 {'MAJOR APPLIANCES': 1, 'ACCESSORIES': 1}
3004002946 {'HOME OFFICE': 2, 'ACCESSORIES': 2, 'MAJOR APPLIANCES': 1,'OTHER STUFF':1}
我的方法我将PYSPARK DF转换为Pandas-DF,因为我对PYSPARK没有太多经验,而且是新手。如果能得到PYSPARK的直接帮助,我将不胜感激,否则即使是Pandas的实现也会很棒。
f = lambda x: dict(zip(x['Description'], x['Counts']))
df = categories.groupby(level=0).apply(f).to_frame('Counts')
print (df)
这并没有给我预期的结果
对于Pyspark解决方案,您可以参考这个SO答案,
代码将作为
df.groupBy("CustomerID").agg(
F.map_from_entries(
F.collect_list(
F.struct("Description", "Counts"))).alias("Description_Counts")
).show(truncate=False)
请注意,map_from_entries仅适用于spark版本>=2.4.0
编辑-根据注释,你可以使用下面的代码,我首先分解列中的每个值,然后根据它们的计数对它们进行分组
dft=spark.createDataFrame([(3004000304,"MAJOR APPLIANCES, HOME OFFICE, OTHER STUFF",3),(3004000304,"HOME OFFICE, MAJOR APPLIANCES",2),(3004000304,"ACCESSORIES, OTHER STUFF",2),(3004002756,"MAJOR APPLIANCES, ACCESSORIES",2),(3004002946,"HOME OFFICE, HOME OFFICE",2),(3004002946,"ACCESSORIES, MAJOR APPLIANCES",2),(3004002946,"MAJOR APPLIANCES, OTHER STUFF, ACCESSORIES",3)],"cid long, descr string, count int")
dft.withColumn("descr",F.explode(F.split("descr",", "))).groupBy("cid","descr").agg(F.count("descr").alias("count")).groupBy("cid").agg(
F.map_from_entries(
F.collect_list(
F.struct("descr", "count"))).alias("Description_Counts")
).show(truncate=False)
# output
+----------+-----------------------------------------------------------------------------+
|cid |Description_Counts |
+----------+-----------------------------------------------------------------------------+
|3004002946|{HOME OFFICE -> 2, MAJOR APPLIANCES -> 2, ACCESSORIES -> 2, OTHER STUFF -> 1}|
|3004000304|{OTHER STUFF -> 2, MAJOR APPLIANCES -> 2, HOME OFFICE -> 2, ACCESSORIES -> 1}|
|3004002756|{ACCESSORIES -> 1, MAJOR APPLIANCES -> 1} |
+----------+-----------------------------------------------------------------------------+
对于Pandas版本,您可以执行:
from collections import Counter
df['Desc'] = df['Description'].apply(lambda x:x.split(","))
df['Desc'] = df['Desc'].apply(lambda x:[e.strip() for e in x])
df2 = df.groupby('CustomerID')['Desc'].agg(sum)
df2['Counts'] = df2['Desc'].apply(lambda x:Counter(x))