Pyspark在数据帧中查找现有的行集,并用另一个数据帧中的值替换它



我有一个Pyspark数据帧_ ld(dfo(,如下所示:

>夏威夷迈阿密德克萨斯州海湾阶地
Id neighbor_sid 邻居
a1 1100 Naalehu

我们可以对id字段进行外部联接,然后使用coalesce()dfn中的字段进行优先级排序。

columns = ['id', 'neighbor_sid', 'neighbor', 'division']
dfo. 
join(dfn, 'id', 'outer'). 
select(*['id'] + [func.coalesce(dfn[k], dfo[k]).alias(k) for k in columns if k != 'id']). 
orderBy('id'). 
show()
# +---+------------+------------+-----------+
# | id|neighbor_sid|    neighbor|   division|
# +---+------------+------------+-----------+
# | a1|        1100|     Naalehu|     Hawaii|
# |a10|        1202|bay-terraces| California|
# | a2|        1111|key-largo-fl|      Miami|
# | a3|        1103|   grapevine|      Texas|
# | a4|        1115|  meriden-ct|Connecticut|
# +---+------------+------------+-----------+

Id执行完整的外部联接,然后在必要时合并以选择第二个表:

# imports and creating data example
from pyspark.sql import functions as F, Window as W
cols = ["id", "neighbor_sid", "neighbor", "division"]
data1 = [
["a1", 1100, 'Naalehu', 'Hawaii'],
["a2", 1101, 'key-west-fl', 'Miami'],
["a3", 1102, 'lubbock', 'Texas'],
["a10", 1202, 'bay-terraces', 'California'],
]
data2 = [
['a1',1100,'Naalehu','Hawaii'],
['a2',1111,'key-largo-fl','Miami'],
['a3',1103,'grapevine','Texas'],
['a4',1115,'meriden-ct','Connecticut'],
]
df0 = spark.createDataFrame(data1, cols)
dfN = spark.createDataFrame(data2, cols)

# solution
merge_df = df0.alias("a").join(dfN.alias('b'), on='id', how='outer')
d = (
merge_df
.select(
"id",
F.coalesce("b.neighbor_sid", "a.neighbor_sid").alias("neighbor_sid"),
F.coalesce("b.neighbor", "a.neighbor").alias("neighbor"),
F.coalesce("b.division", "a.division").alias("division")
)
.sort("neighbor_sid")
)
display(d)

相关内容

  • 没有找到相关文章

最新更新