Azure DataBricks:如何对具有一对多关系的两个数据帧进行内部联接,并从两个数据框架中选择特定列



我以以下方式从json文件中读取数据:

import os,shutil,glob,time
from pyspark.sql.functions import trim 
#Get Data DF1
df1 = spark.read.format("json").load("/mnt/coi/df1.json")
#Get Data DF2
df2 = spark.read.format("json").load("/mnt/coi/df2.json")

我正在加入数据并从两个DF中选择列,但最终结果不正确,并且没有所有数据:

df = df2.join(df1,df2.Number == df1.Number,how="inner").select(df1.abc,df2.xyz)

DF1 JSON,具有唯一的数字列值

{"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0"} 
{"Number":80216884,"Type":"8","ID":2,"Code":"1010","abc":"MT"} 
{"Number":80216885,"Type":"8","ID":2,"Code":"1295","abc":"MS"} 

DF2 JSON具有重复的数字

{"Number":80216883,"DateTime":"2019-11-16","Year":2020,"Quarter":2,"xyz":5,"abc":"M0"}
{"Number":80216883,"DateTime":"2018-11-20","Year":2020,"Quarter":2,"xyz":5,"abc":"M0"}
{"Number":80216884,"DateTime":"2019-11-09","Year":2020,"Quarter":2,"xyz":5,"abc":"MT"}

我想要的结果是:{"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0","DateTime":"2018-11-16","Year":2020,"Quarter":2,"xyz":5} {"Number":80216883,"Type":"8","ID":2,"Code":"1290","abc":"M0","DateTime":"2018-11-20","Year":2020,"Quarter":2,"xyz":5}

如何对具有一对多的两个数据帧进行内部连接关系,并从两个数据帧中选择特定列。?

当我进行连接时,两个DF中存在的一些数字不在最终输出json中。

此外,在将零件文件合并为一个文件时,只有最后一个零件文件被复制到最终数据PFB代码:

dfAll.write.format("json").save("/mnt/coi/DataModel")
#Read Part files
path = glob.glob("/dbfs/mnt/coi/DataModel/part-000*.json")

#Move file to FinalData folder in blbo
for file in path: 
shutil.move(file,"/dbfs/mnt/coi/FinalData/FinalData.json")

为了获得预期的结果,考虑到您只需要关系形式1到多的值,我的方法如下:

from pyspark.sql.functions import col
df = df2.join(df1,df2.Number == df1.Number,how="inner").select(df2.DateTime,df2.Number,df2.Quarter,df2.Year,df2.abc,df2.xyz)
df3 = df.groupBy("Number").count().filter(col("count")>1).select(df.Number)
df4=df3.join(df, df.Number == df3.Number,how="inner")
display(df4)

请告诉我这对你有没有帮助。

最新更新