我有两个相同的数据框new
和old
。new
数据框将在一天中随机更新。下面的代码检查是否有任何更改。
import pandas as pd
import numpy as np
new = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Howard'],
'episodes': [42, 24, 31, 29, 37, 40],
'gender': ['male', 'female', 'female', 'female', 'male', 'male']}
old = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Howard'],
'episodes': [12, 32, 31, 32, 37, 40],
'gender': ['male', 'female', 'female', 'female', 'male', 'male']}
df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])
df = pd.DataFrame(old, columns = ['name','episodes', 'gender'])
while True:
df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])
print(df[~df.episodes.eq(df1.episodes)])
df1 = df
我需要在while
循环中编写条件,其中df[~df.episodes.eq(df1.episodes)]
仅在检测到变化时才打印。在打印新数据之后,它会将数据框设置为相同的值(因为不再需要旧数据)并重新检查更改。上面的代码将输出:
Columns: [name, episodes, gender]
Index: []
Empty DataFrame
Columns: [name, episodes, gender]
Index: []
Empty DataFrame
Columns: [name, episodes, gender]
Index: []
Empty DataFrame
因此,如果更改实际上已经打印,则会忽略。你能建议一种更有效的方法来完成这件事吗?
== Edit ==
根据@BENY的回答,如果我这样做:
import pandas as pd
import numpy as np
new = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Sheldon'],
'episodes': [42, 24, 31, 29, 37, 40],
'gender': ['male', 'female', 'female', 'female', 'male', 'male']}
old = {'name': ['Sheldon', 'Penny', 'Amy', 'Bernadette', 'Raj', 'Sheldon'],
'episodes': [12, 32, 31, 32, 37, 40],
'gender': ['male', 'female', 'female', 'female', 'male', 'male']}
df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])
df = pd.DataFrame(old, columns = ['name','episodes', 'gender'])
while True:
df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])
out = df.merge(df1[['name','episodes']],on=['name','episodes'],how='left',indicator=True).loc[lambda x : x['_merge']=='left_only']
print(out)
df = df1
它会在整个while循环中打印出来:
name episodes gender _merge
0 Sheldon 12 male left_only
1 Penny 32 female left_only
3 Bernadette 32 female left_only
name episodes gender _merge
0 Sheldon 12 male left_only
1 Penny 32 female left_only
3 Bernadette 32 female left_only
name episodes gender _merge
0 Sheldon 12 male left_only
1 Penny 32 female left_only
3 Bernadette 32 female left_only
是否有可能只打印一次?直到有另一个变化。如果我输入df= df1
,那么它将打印如下所示,我将错过更改:
Columns: [name, episodes, gender, _merge]
Index: []
Empty DataFrame
Columns: [name, episodes, gender, _merge]
我需要在检测到更改的地方干净地获取这些数据。
如果您想比较两个数据帧并检查任何更改/差异,为什么不使用DataFrame.compare()
函数呢?
下面是基于示例数据的示例输出:
df.compare(df1)
输出:
episodes
self other
0 12.0 42.0
1 32.0 24.0
3 32.0 29.0
默认情况下,它只突出显示差异。在这里,它显示只有episodes
列有差异。self
对应df
,other
对应df1
左边的索引,即。0
、1
、3
表示差异的行索引
如果您想显示整个原始形状,您也可以使用keep_shape=
参数,如下所示:
df.compare(df1, keep_shape=True)
输出:
name episodes gender
self other self other self other
0 NaN NaN 12.0 42.0 NaN NaN
1 NaN NaN 32.0 24.0 NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN 32.0 29.0 NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
只显示不同的值。
NaN
值为无差异值。当然,如果你愿意,你也可以选择显示所有的值,包括相等的值,如下所示:
df.compare(df1, keep_shape=True, keep_equal=True)
name episodes gender
self other self other self other
0 Sheldon Sheldon 12 42 male male
1 Penny Penny 32 24 female female
2 Amy Amy 31 31 female female
3 Bernadette Bernadette 32 29 female female
4 Raj Raj 37 37 male male
5 Howard Howard 40 40 male male
此选项允许您并排比较以检查差异。无论如何,要发现它们之间的区别就不那么容易了。
我建议你采用默认选项,首先只显示差异(可能是写下有差异行的索引),并可选地,只有当你想要详细检查另一边的值(它们是相等的)时才使用其他2个选项。
要在while
循环下使用,可以使用:
while True:
df1 = pd.DataFrame(new, columns = ['name','episodes', 'gender'])
out = df.compare(df1)
print(out)
df = df1
编辑
如果您希望看到name
,而保持只看到其他列的差异,您可以使用append=True
设置索引,如下所示:
df.set_index('name', append=True).compare(df1.set_index('name', append=True))
episodes
self other
name
0 Sheldon 12.0 42.0
1 Penny 32.0 24.0
3 Bernadette 32.0 29.0
通过这种方式,您可以看到name
和行索引之间的差异。
让我们试试merge
out = df.merge(df1[['name','episodes']],on=['name','episodes'],how='left',indicator=True).loc[lambda x : x['_merge']=='left_only']
name episodes gender _merge
0 Sheldon 12 male left_only
1 Penny 32 female left_only
3 Bernadette 32 female left_only