我有 2 个数据帧,每个数据帧有 300 万条记录。我需要比较两者并显示每个匹配索引、发生不匹配的列名以及这些不匹配列中的数据。
下面是数据帧和预期输出的示例:
df1=pd.DataFrame({
'Date':['2013-11-23','2013-11-24','2013-11-25','2013-11-26','2013-11-27'],
'Fruit':['Banana','Orange','Apple','Celery','Apple'],
'Num':[10.2,22.1,8.6,7.6,10.2],
'Color':['Green','Yellow','Orange','Green','Green']
})
df2=pd.DataFrame({
'Date':['2013-11-23','2013-11-24','2013-11-25','2013-11-26','2013-11-27'],
'Fruit':['Banana1','Orange','Apple','Celery','Apple'],
'Num':[10.2,22.12,8.60,7.6,10.2],
'Color':['Green','Yellow1','Orange','Green','Green']
})
df1.set_index("Date",inplace = True)
df2.set_index("Date",inplace = True)
Dataframe 1:
Fruit Num Color
Date
2013-11-23 Banana 10.2 Green
2013-11-24 Orange 22.1 Yellow
2013-11-25 Apple 8.6 Orange
2013-11-26 Celery 7.6 Green
2013-11-27 Apple 10.2 Green
DataFrame 2:
Fruit Num Color
Date
2013-11-23 Banana1 10.21 Green
2013-11-24 Orange 22.12 Yellow1
2013-11-25 Apple 8.60 Orange
2013-11-26 Celery 7.60 Green
2013-11-27 Apple 10.20 Green
Expected OUTPUT:
Date 2013-11-23 has mismatch in columns ['Fruit' , 'Num'].
DF1: ['Banana',10.2]
DF2: ['Banana1', 10.21]
--------------------------------------------------------
Date 2013-11-24 has mismatch in columns ['Num' , 'Color'].
DF1: [22.1,'Yellow']
DF2: [22.12,'Yellow1']
----------------------------------------------------------
And so on for every Date index
将文件另存为.py并运行它
import multiprocessing as mp
import pandas as pd
def compare(data):
df1_data, df2_data = data
date = df1_data[0]
d1 = df1_data[1]
d2 = df2_data[1]
# print(d1)
# print(d2)
cmp = d1 != d2
if any(cmp):
print('Date ', date, end=' ')
diff_cols = cmp[cmp].index.tolist()
d1_diff = d1[diff_cols]
d2_diff = d2[diff_cols]
print('has mismatch in columns', diff_cols, '.')
print('DF1:', d1_diff.tolist())
print('DF2:', d2_diff.tolist())
print('-'*20)
if __name__ == "__main__":
df1 = pd.DataFrame({
'Date': ['2013-11-23', '2013-11-24', '2013-11-25', '2013-11-26', '2013-11-27'],
'Fruit': ['Banana', 'Orange', 'Apple', 'Celery', 'Apple'],
'Num': [10.2, 22.1, 8.6, 7.6, 10.2],
'Color': ['Green', 'Yellow', 'Orange', 'Green', 'Green']
})
df2 = pd.DataFrame({
'Date': ['2013-11-23', '2013-11-24', '2013-11-25', '2013-11-26', '2013-11-27'],
'Fruit': ['Banana1', 'Orange', 'Apple', 'Celery', 'Apple'],
'Num': [10.2, 22.12, 8.60, 7.6, 10.2],
'Color': ['Green', 'Yellow1', 'Orange', 'Green', 'Green']
})
df1.set_index("Date", inplace=True)
df2.set_index("Date", inplace=True)
n_cpu = mp.cpu_count()
pool = mp.Pool(n_cpu-2)
results = pool.map(compare, zip(df1.iterrows(), df2.iterrows()))
输出
Date 2013-11-23 has mismatch in columns ['Fruit'] .
DF1: ['Banana']
DF2: ['Banana1']
--------------------
Date 2013-11-24 has mismatch in columns ['Num', 'Color'] .
DF1: [22.1, 'Yellow']
DF2: [22.12, 'Yellow1']
--------------------