比较两个字典和过滤器



我有一个包含

的dictionary1
{'A': Timestamp('2022-05-23 00:00:00'), 'L': Timestamp('2017-06-21 00:00:00'), 'S': Timestamp('2021-11-02 00:00:00'), 'D': Timestamp('2021-11-08 00:00:00')}

然后我有另一个字典2看起来像

{'A': [Timestamp('2022-01-16 00:00:00'),
Timestamp('2022-01-13 00:00:00'),
Timestamp('2022-01-12 00:00:00'),
Timestamp('2023-01-10 00:00:00')],
'L': [Timestamp('2023-01-16 00:00:00'),
Timestamp('2023-01-13 00:00:00'),
Timestamp('2023-01-12 00:00:00')],
'S': [Timestamp('2021-01-16 00:00:00'),
Timestamp('2022-01-13 00:00:00'),
Timestamp('2023-01-12 00:00:00')],
'D': [Timestamp('2023-01-16 00:00:00'),
Timestamp('2022-10-18 00:00:00')]}

我希望每个A, L, S, D只具有GREATER的日期而不是字典里的日期

所以我想要的输出是
{'A': [Timestamp('2023-01-10 00:00:00')],
'L': [Timestamp('2023-01-16 00:00:00'),
Timestamp('2023-01-13 00:00:00'),
Timestamp('2023-01-12 00:00:00')],
'S': [Timestamp('2022-01-13 00:00:00'),
Timestamp('2023-01-12 00:00:00')],
'D': [Timestamp('2023-01-16 00:00:00'),
Timestamp('2022-10-18 00:00:00')]}

这是熊猫的解决方案

import pandas as pd
from pandas import Timestamp
d1 = {'A': Timestamp('2022-05-23 00:00:00'),
'L': Timestamp('2017-06-21 00:00:00'),
'S': Timestamp('2021-11-02 00:00:00'),
'D': Timestamp('2021-11-08 00:00:00')}
d2 = {'A': [Timestamp('2022-01-16 00:00:00'),
Timestamp('2022-01-13 00:00:00'),
Timestamp('2022-01-12 00:00:00'),
Timestamp('2023-01-10 00:00:00')],
'L': [Timestamp('2023-01-16 00:00:00'),
Timestamp('2023-01-13 00:00:00'),
Timestamp('2023-01-12 00:00:00')],
'S': [Timestamp('2021-01-16 00:00:00'),
Timestamp('2022-01-13 00:00:00'),
Timestamp('2023-01-12 00:00:00')],
'D': [Timestamp('2023-01-16 00:00:00'),
Timestamp('2022-10-18 00:00:00')]}
# convert to a series and explode lists
s1 = pd.Series(d1, name='date')
s2 = pd.Series(d2, name='date').explode()
# merge your pd.Series together on the index
m = pd.merge(s2, s1, right_index=True, left_index=True, how='left')
# boolean indexing to filter your dates where s2 date > s1 date
new_df = m[m['date_x'] > m['date_y']]
date_x     date_y
A 2023-01-10 2022-05-23
D 2023-01-16 2021-11-08
D 2022-10-18 2021-11-08
L 2023-01-16 2017-06-21
L 2023-01-13 2017-06-21
L 2023-01-12 2017-06-21
S 2022-01-13 2021-11-02
S 2023-01-12 2021-11-02

给定两个数据源,您可以使用推导式根据条件创建一个新列表:

import datetime
Timestamp = lambda s: datetime.datetime.strptime(s, "%Y-%m-%d  %H:%M:%S")
lookup = {
'A': Timestamp('2022-05-23 00:00:00'),
'L': Timestamp('2017-06-21 00:00:00'),
'S': Timestamp('2021-11-02 00:00:00'),
'D': Timestamp('2021-11-08 00:00:00')
}
data_in = {
'A': [
Timestamp('2023-01-10 00:00:00')
],
'L': [
Timestamp('2023-01-16 00:00:00'),
Timestamp('2023-01-13 00:00:00'),
Timestamp('2023-01-12 00:00:00')
],
'S': [
Timestamp('2022-01-13 00:00:00'),
Timestamp('2023-01-12 00:00:00')
],
'D': [
Timestamp('2023-01-16 00:00:00'),
Timestamp('2022-10-18 00:00:00')
]
}
data_out = [
{key: [v for v in value if v > lookup[key]]}
for key, value
in data_in.items()
]
print(data_out)

我不知道Timestamp是什么,但如果它有一个函数,返回日期作为字符串(或任何其他数据结构与>定义),你可以做

# This is some class that knows its stamp value (the "date")
class Timestamp:                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                    
def __init__(self, value):                                                                                                                                                                                                                                                                                            
self._value = value                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                    
@property                                                                                                                                                                                                                                                                                                             
def value(self):                                                                                                                                                                                                                                                                                                      
return self._value                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                    
# This is your reference dict.                                                                                                                                                                                                                                                                                                                         
d1 = {                                                                                                                                                                                                                                                                                                                    
'A': Timestamp('2022-05-23 00:00:00'),                                                                                                                                                                                                                                                                                
'L': Timestamp('2017-06-21 00:00:00'),                                                                                                                                                                                                                                                                                
'S': Timestamp('2021-11-02 00:00:00'),                                                                                                                                                                                                                                                                                
'D': Timestamp('2021-11-08 00:00:00')                                                                                                                                                                                                                                                                                 
}
# This is the data you want to clean.                                                                                                                                                                                                                                                                                                                        
d2 = {                                                                                                                                                                                                                                                                                                                    
'A': [                                                                                                                                                                                                                                                                                                                
Timestamp('2022-01-16 00:00:00'),                                                                                                                                                                                                                                                                                 
Timestamp('2022-01-13 00:00:00'),                                                                                                                                                                                                                                                                                 
Timestamp('2022-01-12 00:00:00'),                                                                                                                                                                                                                                                                                 
Timestamp('2023-01-10 00:00:00')                                                                                                                                                                                                                                                                                  
],                                                                                                                                                                                                                                                                                                                    
'L': [                                                                                                                                                                                                                                                                                                                
Timestamp('2023-01-16 00:00:00'),                                                                                                                                                                                                                                                                                 
Timestamp('2023-01-13 00:00:00'),                                                                                                                                                                                                                                                                                 
Timestamp('2023-01-12 00:00:00')                                                                                                                                                                                                                                                                                  
],                                                                                                                                                                                                                                                                                                                    
'S': [                                                                                                                                                                                                                                                                                                                
Timestamp('2021-01-16 00:00:00'),                                                                                                                                                                                                                                                                                 
Timestamp('2022-01-13 00:00:00'),                                                                                                                                                                                                                                                                                 
Timestamp('2023-01-12 00:00:00')                                                                                                                                                                                                                                                                                  
],                                                                                                                                                                                                                                                                                                                    
'D': [Timestamp('2023-01-16 00:00:00'),                                                                                                                                                                                                                                                                               
Timestamp('2022-10-18 00:00:00')]                                                                                                                                                                                                                                                                               
}                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                    
# This is the new dict you want.                                                                                                                                                                                                                                                                                          
d3 = {                                                                                                                                                                                                                                                                                                                    
key: [stamp for stamp in stamplist if stamp.value > d1[key].value]                                                                                                                                                                                                                                                    
for (key, stamplist) in d2.items()                                                                                                                                                                                                                                                                                    
}                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                    
# Check it:                                                                                                                                                                                                                                                                                                               
for key, stamplist in d3.items():                                                                                                                                                                                                                                                                                         
for stamp in stamplist:                                                                                                                                                                                                                                                                                               
print(stamp.value) 

对于pandas,一种方法是使用pandas.Series构造函数与dict/listcomp:

from pandas import Timestamp
s1 = pd.Series(dictionary1)
s2 = pd.Series(dictionary2)
​
out = {k: [v for v in s2[k] if k > s1[k]] for k in s2.index}

输出:

{'A': [Timestamp('2023-01-10 00:00:00')],
'L': [Timestamp('2023-01-16 00:00:00'), Timestamp('2023-01-13 00:00:00'), Timestamp('2023-01-12 00:00:00')],
'S': [Timestamp('2022-01-13 00:00:00'), Timestamp('2023-01-12 00:00:00')],
'D': [Timestamp('2023-01-16 00:00:00'), Timestamp('2022-10-18 00:00:00')]}

最新更新