在pyspark中,将列中的列表拆分为一个热编码功能



我有一个pyspark数据帧,如下所示:

>xxx>zzz
id tmp_list 其他功能
1 ["疼痛","意大利"]
2 [法国、法国、美国、印度] yyy
3 [法国,德国]

步骤1:用常量字符串替换tmp_list中的所有非欧盟国家

from pyspark.sql import functions as F
df = ...
EU_countries = ['Spain', 'Italy', 'France', 'Germany']
def replaceNonEU(c):
cond = c == EU_countries[0]
for country in EU_countries[1:]:
cond |= (c == country)
return F.when(cond, c).otherwise(F.lit("other_countries"))
df = df.withColumn("tmp_list", F.array_distinct(F.transform("tmp_list", replaceNonEU)))
#+---+--------------------------------+--------------+
#|id |tmp_list                        |other features|
#+---+--------------------------------+--------------+
#|1  |[Spain, Italy]                  |xxx           |
#|2  |[Spain, France, other_countries]|yyy           |
#|3  |[Spain, Germany]                |zzz           |
#+---+--------------------------------+--------------+

步骤2:为tmp_list中的每个可能值创建一个新列,指示该值是否为tmp_list:的元素

for c in EU_countries + ['other_countries']:
df = df.withColumn(c, F.array_contains("tmp_list", c).cast("int"))
df = df.drop("tmp_list")
#+---+--------------+-----+-----+------+-------+---------------+
#| id|other features|Spain|Italy|France|Germany|other_countries|
#+---+--------------+-----+-----+------+-------+---------------+
#|  1|           xxx|    1|    1|     0|      0|              0|
#|  2|           yyy|    1|    0|     1|      0|              1|
#|  3|           zzz|    1|    0|     0|      1|              0|
#+---+--------------+-----+-----+------+-------+---------------+

像在熊猫里一样推理和工作。

  1. 分解
  2. 创建将非EU_countries归属为other_countries的类别
  3. get_dummies。在这一点上,我赞扬了这篇文章

下面的代码;

df=df.select('*').withColumn('tmp_list1', F.explode(col('tmp_list')))#Create new column with exploded list
df=df.select('*').withColumn('Cat', when(col('tmp_list1').isin(EU_countries),df.tmp_list1).otherwise('other_countries'))#Create another column Cat
df.groupBy("tmp_list",'other features').pivot("Cat").agg(F.lit(1)).na.fill(0).show()#Get dummies

+---------------------------+--------------+------+-------+-----+-----+---------------+
|tmp_list                   |other features|France|Germany|Italy|Spain|other_countries|
+---------------------------+--------------+------+-------+-----+-----+---------------+
|[Spain, Germany]           |zzz           |0     |1      |0    |1    |0              |
|[Spain, Italy]             |xxx           |0     |0      |1    |1    |0              |
|[Spain, France, USA, India]|yyy           |1     |0      |0    |1    |1              |
+---------------------------+--------------+------+-------+-----+-----+---------------+

最新更新