将具有逗号分隔字符串的两列合并为具有联接唯一性的一列



假设我有一个数据帧如下:

| id| col 1 |col 2
| 1 | "A,B" |"A,D"
| 2 | "A,B" |"X,Y"
| 3 | "B,D" |"B,D"

我需要一个输出为:

| id|merged(col1 and col2)
| 1 | "A,B,D"
| 2 | "A,B,X,Y"
| 3 | "B,D"

拆分列col1col2以获得数组,然后使用array_union并最后使用array_join来获得字符串:

import pyspark.sql.functions as F
df = spark.createDataFrame([
(1, "A,B", "A,D"),
(2, "A,B", "X,Y"),
(3, "B,D", "B,D")
], ["id", "col1", "col2"])
df1 = df.select(
"id",
F.array_join(
F.array_union(F.split("col1", ","), F.split("col2", ",")),
","
).alias("merged_col1_col2")
)
df1.show()
#+---+----------------+
#| id|merged_col1_col2|
#+---+----------------+
#|  1|           A,B,D|
#|  2|         A,B,X,Y|
#|  3|             B,D|
#+---+----------------+

您需要创建一个panda函数来处理合并任务,然后使用apply函数添加您的函数来合并列。像这个

import pandas as pd
def h(a,b):
a = a.split(",")
b = b.split(",")
u = a + b
return ','.join(sorted(list(set(u)))) 
df = pd.DataFrame({'col 1': ["A,B", "A,B", "B,D"], 'col 2': ["A,D", 
"X,Y", "B,D"]})
df['merged'] = df.apply( lambda row: h(row['col 1'], row['col 2']), 
axis = 1 )
df.head() 

最新更新