假设我有一个数据帧如下:
| id| col 1 |col 2
| 1 | "A,B" |"A,D"
| 2 | "A,B" |"X,Y"
| 3 | "B,D" |"B,D"
我需要一个输出为:
| id|merged(col1 and col2)
| 1 | "A,B,D"
| 2 | "A,B,X,Y"
| 3 | "B,D"
拆分列col1
和col2
以获得数组,然后使用array_union
并最后使用array_join
来获得字符串:
import pyspark.sql.functions as F
df = spark.createDataFrame([
(1, "A,B", "A,D"),
(2, "A,B", "X,Y"),
(3, "B,D", "B,D")
], ["id", "col1", "col2"])
df1 = df.select(
"id",
F.array_join(
F.array_union(F.split("col1", ","), F.split("col2", ",")),
","
).alias("merged_col1_col2")
)
df1.show()
#+---+----------------+
#| id|merged_col1_col2|
#+---+----------------+
#| 1| A,B,D|
#| 2| A,B,X,Y|
#| 3| B,D|
#+---+----------------+
您需要创建一个panda函数来处理合并任务,然后使用apply函数添加您的函数来合并列。像这个
import pandas as pd
def h(a,b):
a = a.split(",")
b = b.split(",")
u = a + b
return ','.join(sorted(list(set(u))))
df = pd.DataFrame({'col 1': ["A,B", "A,B", "B,D"], 'col 2': ["A,D",
"X,Y", "B,D"]})
df['merged'] = df.apply( lambda row: h(row['col 1'], row['col 2']),
axis = 1 )
df.head()