我想比较 2 个数据帧,我想根据以下 3 个条件提取记录。
- 如果记录匹配,则"SAME"应出现在新列 FLAG 中。 如果
- 记录不匹配,如果它来自 df1(假设第 66 号(,"DF1"应该出现在 FLAG 列中。 如果
- 记录不匹配,如果它来自 df2(假设第 77 号(,"DF2"应该出现在 FLAG 列中。这里整个记录需要考虑和验证。记录明智的比较。
此外,我需要使用 PySpark 代码检查数百万条记录。
DF1:
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
DF2:
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3000,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Xom,5000,mex,IT,2/11/2019
77,XYZ,5000,mex,IT,2/11/2019
预期输出:
No,Name,Sal,Address,Dept,Join_Date,FLAG
11,Sam,1000,ind,IT,2/11/2019,SAME
22,Tom,2000,usa,HR,2/11/2019,SAME
33,Kom,3500,uk,IT,2/11/2019,DF1
33,Kom,3000,uk,IT,2/11/2019,DF2
44,Nom,4000,can,HR,2/11/2019,SAME
55,Vom,5000,mex,IT,2/11/2019,DF1
55,Xom,5000,mex,IT,2/11/2019,DF2
66,XYZ,5000,mex,IT,2/11/2019,DF1
77,XYZ,5000,mex,IT,2/11/2019,DF2
我加载了如下所示的输入数据,但不知道如何继续。
df1 = pd.read_csv("D:\inputs\file1.csv")
df2 = pd.read_csv("D:\inputs\file2.csv")
任何帮助,不胜感激。谢谢。
# Requisite packages to import
import sys
from pyspark.sql.functions import lit, count, col, when
from pyspark.sql.window import Window
# Create the two dataframes
df1 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3500,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Vom',5000,'mex','IT','2/11/2019'),(66,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df2 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3000,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Xom',5000,'mex','IT','2/11/2019'),(77,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df1 = df1.withColumn('FLAG',lit('DF1'))
df2 = df2.withColumn('FLAG',lit('DF2'))
# Concatenate the two DataFrames, to create one big dataframe.
df = df1.union(df2)
使用窗口函数检查相同行的计数是否大于 1,如果确实大于 1,则将列FLAG
标记为 SAME
,否则保持原样。最后,删除重复项。
my_window = Window.partitionBy('No','Name','Sal','Address','Dept','Join_Date').rowsBetween(-sys.maxsize, sys.maxsize)
df = df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()
df.show()
+---+----+----+-------+----+---------+----+
| No|Name| Sal|Address|Dept|Join_Date|FLAG|
+---+----+----+-------+----+---------+----+
| 33| Kom|3000| uk| IT|2/11/2019| DF2|
| 44| Nom|4000| can| HR|2/11/2019|SAME|
| 22| Tom|2000| usa| HR|2/11/2019|SAME|
| 77| XYZ|5000| mex| IT|2/11/2019| DF2|
| 55| Xom|5000| mex| IT|2/11/2019| DF2|
| 11| Sam|1000| ind| IT|2/11/2019|SAME|
| 66| XYZ|5000| mex| IT|2/11/2019| DF1|
| 55| Vom|5000| mex| IT|2/11/2019| DF1|
| 33| Kom|3500| uk| IT|2/11/2019| DF1|
+---+----+----+-------+----+---------+----+
<</div>
div class="one_answers"> 我认为您可以通过创建临时列来指示来源和join
来解决您的问题.然后,您只需要检查条件,即是否两个来源都存在,或者只有一个来源以及哪一个。
请考虑以下代码:
from pyspark.sql.functions import *
df1= sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),
(22,'Tom',2000,'usa','HR','2/11/2019'),(33,'Kom',3500,'uk','IT','2/11/2019'),
(44,'Nom',4000,'can','HR','2/11/2019'),(55,'Vom',5000,'mex','IT','2/11/2019'),
(66,'XYZ',5000,'mex','IT','2/11/2019')],
["No","Name","Sal","Address","Dept","Join_Date"])
df2= sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),
(22,'Tom',2000,'usa','HR','2/11/2019'),(33,'Kom',3000,'uk','IT','2/11/2019'),
(44,'Nom',4000,'can','HR','2/11/2019'),(55,'Xom',5000,'mex','IT','2/11/2019'),
(77,'XYZ',5000,'mex','IT','2/11/2019')],
["No","Name","Sal","Address","Dept","Join_Date"])
#creation of your example dataframes
df1 = df1.withColumn("Source1", lit("DF1"))
df2 = df2.withColumn("Source2", lit("DF2"))
#temporary columns to refer the origin later
df1.join(df2, ["No","Name","Sal","Address","Dept","Join_Date"],"full")
#full join on all columns, but source is only set if record appears in original dataframe
.withColumn("FLAG",when(col("Source1").isNotNull() & col("Source2").isNotNull(), "SAME")
#condition if record appears in both dataframes
.otherwise(when(col("Source1").isNotNull(), "DF1").otherwise("DF2")))
#condition if record appears in one dataframe
.drop("Source1","Source2").show() #remove temporary columns and show result
输出:
+---+----+----+-------+----+---------+----+
| No|Name| Sal|Address|Dept|Join_Date|FLAG|
+---+----+----+-------+----+---------+----+
| 33| Kom|3000| uk| IT|2/11/2019| DF2|
| 44| Nom|4000| can| HR|2/11/2019|SAME|
| 22| Tom|2000| usa| HR|2/11/2019|SAME|
| 77| XYZ|5000| mex| IT|2/11/2019| DF2|
| 55| Xom|5000| mex| IT|2/11/2019| DF2|
| 11| Sam|1000| ind| IT|2/11/2019|SAME|
| 66| XYZ|5000| mex| IT|2/11/2019| DF1|
| 55| Vom|5000| mex| IT|2/11/2019| DF1|
| 33| Kom|3500| uk| IT|2/11/2019| DF1|
+---+----+----+-------+----+---------+----+