比较两个火花数据帧之间的数据,并在相应列中填充"如果匹配则通过"和"未通过"



我有两个类似的火花数据帧

Source1数据

Source2数据

我使用pyspark-python使用Snapshot_Date as Key列比较两个源之间的数据,并希望在另一个数据帧中显示结果,如下面的

比较彩色编码便于理解,不需要

提前感谢

您可以使用这个答案中提到的spark扩展包比较两个数据帧Pyspark

您可以修改result数据帧以适当地获得最终的数据帧。

from gresearch.spark.diff import *
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data1 = [
("Source1",20201116, 436039, 123, 222, 333,0, 555),
("Source1",20201117,436034, 234, 34, 7, 5, 678)
]
df1Columns = ["Source","Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
print("Source1 dataframe")
df1.show(truncate=False)
data2 = [
("Source2", 20201116,436039,234,234,333,0,555),
("Source2", 20201117,436034,234,5,7,5,678)
]
df2Columns = ["Source", "Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df2 = sqlContext.createDataFrame(data=data2, schema = df2Columns)
print(" Source2 dataframe")
df2.show(truncate=False)
options = DiffOptions().with_change_column("changes")
result = df1.diff_with_options(df2, options, 'Snapshot_Date')
result.select("Snapshot_Date", "diff","changes").show(truncate=False)

输出如图所示。列changes列出了其中存在更改的列。

Source1 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source1|20201116     |436039   |123 |222 |333 |0   |555 |
|Source1|20201117     |436034   |234 |34  |7   |5   |678 |
+-------+-------------+---------+----+----+----+----+----+
Source2 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source2|20201116     |436039   |234 |234 |333 |0   |555 |
|Source2|20201117     |436034   |234 |5   |7   |5   |678 |
+-------+-------------+---------+----+----+----+----+----+
+-------------+----+--------------------+
|Snapshot_Date|diff|changes             |
+-------------+----+--------------------+
|20201116     |C   |[Source, Col1, Col2]|
|20201117     |C   |[Source, Col2]      |
+-------------+----+--------------------+

最新更新