我有一个pyspark数据帧,如下所示:
id | tmp_list | 其他功能 |
---|---|---|
1 | ["疼痛","意大利"] | >xxx|
2 | [法国、法国、美国、印度] | yyy |
3 | [法国,德国] | >zzz
步骤1:用常量字符串替换tmp_list
中的所有非欧盟国家
from pyspark.sql import functions as F
df = ...
EU_countries = ['Spain', 'Italy', 'France', 'Germany']
def replaceNonEU(c):
cond = c == EU_countries[0]
for country in EU_countries[1:]:
cond |= (c == country)
return F.when(cond, c).otherwise(F.lit("other_countries"))
df = df.withColumn("tmp_list", F.array_distinct(F.transform("tmp_list", replaceNonEU)))
#+---+--------------------------------+--------------+
#|id |tmp_list |other features|
#+---+--------------------------------+--------------+
#|1 |[Spain, Italy] |xxx |
#|2 |[Spain, France, other_countries]|yyy |
#|3 |[Spain, Germany] |zzz |
#+---+--------------------------------+--------------+
步骤2:为tmp_list
中的每个可能值创建一个新列,指示该值是否为tmp_list
:的元素
for c in EU_countries + ['other_countries']:
df = df.withColumn(c, F.array_contains("tmp_list", c).cast("int"))
df = df.drop("tmp_list")
#+---+--------------+-----+-----+------+-------+---------------+
#| id|other features|Spain|Italy|France|Germany|other_countries|
#+---+--------------+-----+-----+------+-------+---------------+
#| 1| xxx| 1| 1| 0| 0| 0|
#| 2| yyy| 1| 0| 1| 0| 1|
#| 3| zzz| 1| 0| 0| 1| 0|
#+---+--------------+-----+-----+------+-------+---------------+
像在熊猫里一样推理和工作。
- 分解
- 创建将非EU_countries归属为other_countries的类别
- get_dummies。在这一点上,我赞扬了这篇文章
下面的代码;
df=df.select('*').withColumn('tmp_list1', F.explode(col('tmp_list')))#Create new column with exploded list
df=df.select('*').withColumn('Cat', when(col('tmp_list1').isin(EU_countries),df.tmp_list1).otherwise('other_countries'))#Create another column Cat
df.groupBy("tmp_list",'other features').pivot("Cat").agg(F.lit(1)).na.fill(0).show()#Get dummies
+---------------------------+--------------+------+-------+-----+-----+---------------+
|tmp_list |other features|France|Germany|Italy|Spain|other_countries|
+---------------------------+--------------+------+-------+-----+-----+---------------+
|[Spain, Germany] |zzz |0 |1 |0 |1 |0 |
|[Spain, Italy] |xxx |0 |0 |1 |1 |0 |
|[Spain, France, USA, India]|yyy |1 |0 |0 |1 |1 |
+---------------------------+--------------+------+-------+-----+-----+---------------+