Python - 打印 2 个数据帧比较结果的最快方法



我有 2 个数据帧,每个数据帧有 300 万条记录。我需要比较两者并显示每个匹配索引、发生不匹配的列名以及这些不匹配列中的数据。

下面是数据帧和预期输出的示例:

df1=pd.DataFrame({
'Date':['2013-11-23','2013-11-24','2013-11-25','2013-11-26','2013-11-27'],
'Fruit':['Banana','Orange','Apple','Celery','Apple'],
'Num':[10.2,22.1,8.6,7.6,10.2],
'Color':['Green','Yellow','Orange','Green','Green']
})
df2=pd.DataFrame({
'Date':['2013-11-23','2013-11-24','2013-11-25','2013-11-26','2013-11-27'],
'Fruit':['Banana1','Orange','Apple','Celery','Apple'],
'Num':[10.2,22.12,8.60,7.6,10.2],
'Color':['Green','Yellow1','Orange','Green','Green']
})
df1.set_index("Date",inplace = True)
df2.set_index("Date",inplace = True)
Dataframe 1:
Fruit   Num   Color
Date                            
2013-11-23  Banana  10.2   Green
2013-11-24  Orange  22.1  Yellow
2013-11-25   Apple   8.6  Orange
2013-11-26  Celery   7.6   Green
2013-11-27   Apple  10.2   Green
DataFrame 2:
Fruit    Num    Color
Date                               
2013-11-23  Banana1  10.21    Green
2013-11-24   Orange  22.12  Yellow1
2013-11-25    Apple   8.60   Orange
2013-11-26   Celery   7.60    Green
2013-11-27    Apple  10.20    Green
Expected OUTPUT:
Date 2013-11-23 has mismatch in columns ['Fruit' , 'Num'].
DF1: ['Banana',10.2]
DF2: ['Banana1', 10.21]
--------------------------------------------------------
Date 2013-11-24 has mismatch in columns ['Num' , 'Color'].
DF1: [22.1,'Yellow']
DF2: [22.12,'Yellow1']
----------------------------------------------------------
And so on for every Date index

将文件另存为.py并运行它

import multiprocessing as mp
import pandas as pd

def compare(data):
df1_data, df2_data = data
date = df1_data[0]
d1 = df1_data[1]
d2 = df2_data[1]
# print(d1)
# print(d2)
cmp = d1 != d2
if any(cmp):
print('Date ', date, end=' ')
diff_cols = cmp[cmp].index.tolist()
d1_diff = d1[diff_cols]
d2_diff = d2[diff_cols]
print('has mismatch in columns', diff_cols, '.')
print('DF1:', d1_diff.tolist())
print('DF2:', d2_diff.tolist())
print('-'*20)

if __name__ == "__main__":
df1 = pd.DataFrame({
'Date': ['2013-11-23', '2013-11-24', '2013-11-25', '2013-11-26', '2013-11-27'],
'Fruit': ['Banana', 'Orange', 'Apple', 'Celery', 'Apple'],
'Num': [10.2, 22.1, 8.6, 7.6, 10.2],
'Color': ['Green', 'Yellow', 'Orange', 'Green', 'Green']
})
df2 = pd.DataFrame({
'Date': ['2013-11-23', '2013-11-24', '2013-11-25', '2013-11-26', '2013-11-27'],
'Fruit': ['Banana1', 'Orange', 'Apple', 'Celery', 'Apple'],
'Num': [10.2, 22.12, 8.60, 7.6, 10.2],
'Color': ['Green', 'Yellow1', 'Orange', 'Green', 'Green']
})
df1.set_index("Date", inplace=True)
df2.set_index("Date", inplace=True)
n_cpu = mp.cpu_count()
pool = mp.Pool(n_cpu-2)
results = pool.map(compare, zip(df1.iterrows(), df2.iterrows()))

输出

Date  2013-11-23 has mismatch in columns ['Fruit'] .
DF1: ['Banana']
DF2: ['Banana1']
--------------------
Date  2013-11-24 has mismatch in columns ['Num', 'Color'] .
DF1: [22.1, 'Yellow']
DF2: [22.12, 'Yellow1']
--------------------

最新更新