从datarane panda中获取丢失的日期



我想为列中的每个键获取丢失的日期。我的数据帧如下,

size     number     key      date
0  153.2 K    12345     Hello     20181002
1  153.2 K    12345     No        20181001
2  153.2 K    12345     Hello     20181003
3  153.2 K    12345     No        20181003
4  153.2 K    12345     Hello     20181004
5  153.2 K    12345     No        20181005
6  153.2 K    12345     Hello     20181006

我想得到以下结果

key        date
No        20181002
No        20181004
Hello     20181005

对于键="0";否";没有日期20181002、20181004。我想说的是,每个键都有不同的日期范围,输出应该是特定日期范围中每个键的缺失日期

我使用了以下代码,但它采用了两个键的日期,开始日期为20181001,结束日期为20181006

import pandas as pd
df = pd.read_csv('4002.csv')
print(df.head(1))
dates = pd.date_range(*pd.to_datetime(df['date'], format='%Y%m%d',errors='coerce').agg(['min', 'max']), freq='D').strftime('%Y%m%d').astype(int)
df1=pd.DataFrame(index=pd.Index(df['key'].unique(), name='key'),columns=dates.difference(df['date'])).reset_index().melt('key').drop(columns=['value'])
print(df1)
df1.to_csv('4002output.csv', index=False)

让我们试试:

df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
midx = pd.MultiIndex.from_frame(
df.groupby('key')['date'].agg(['min', 'max'])
.apply(lambda x: pd.date_range(x['min'], x['max']), axis=1)
.explode()
.reset_index(),
names=['key', 'date']
)
new_df = (df.set_index(['key', 'date'])
.reindex(midx)
.loc[lambda df_: df_['size'].isna()]
.index
.to_frame(index=False)
.rename(columns={0: 'date'}))

new_df:

key       date
0  Hello 2018-10-05
1     No 2018-10-02
2     No 2018-10-04

可选将日期转换回字符串:

new_df['date'] = new_df['date'].dt.strftime('%Y%m%d')
key      date
0  Hello  20181005
1     No  20181002
2     No  20181004

数据帧:

df = pd.DataFrame({
'size': ['153.2 K', '153.2 K', '153.2 K', '153.2 K', '153.2 K', '153.2 K',
'153.2 K'],
'number': [12345, 12345, 12345, 12345, 12345, 12345, 12345],
'key': ['Hello', 'No', 'Hello', 'No', 'Hello', 'No', 'Hello'],
'date': [20181002, 20181001, 20181003, 20181003, 20181004, 20181005,
20181006]
})

解释:

  1. Groupby aggregate每个key组的minmax值:
df.groupby('key')['date'].agg(['min', 'max'])
min        max
key                        
Hello 2018-10-02 2018-10-06
No    2018-10-01 2018-10-05
  1. 将其转换为Date Range
df.groupby('key')['date'].agg(['min', 'max']) 
.apply(lambda x: pd.date_range(x['min'], x['max']), axis=1)
key
Hello    DatetimeIndex(['2018-10-02', '2018-10-03', '20...
No       DatetimeIndex(['2018-10-01', '2018-10-02', '20...
dtype: object
  1. explode分为多行:
df.groupby('key')['date'].agg(['min', 'max']) 
.apply(lambda x: pd.date_range(x['min'], x['max']), axis=1) 
.explode()

key
Hello   2018-10-02
Hello   2018-10-03
Hello   2018-10-04
Hello   2018-10-05
Hello   2018-10-06
No      2018-10-01
No      2018-10-02
No      2018-10-03
No      2018-10-04
No      2018-10-05
dtype: datetime64[ns]
  1. 将此帧转换为MultiIndex.from_frame
midx = pd.MultiIndex.from_frame(
df.groupby('key')['date'].agg(['min', 'max'])
.apply(lambda x: pd.date_range(x['min'], x['max']), axis=1)
.explode()
.reset_index(),
names=['key', 'date']
)
MultiIndex([('Hello', '2018-10-02'),
('Hello', '2018-10-03'),
('Hello', '2018-10-04'),
('Hello', '2018-10-05'),
('Hello', '2018-10-06'),
(   'No', '2018-10-01'),
(   'No', '2018-10-02'),
(   'No', '2018-10-03'),
(   'No', '2018-10-04'),
(   'No', '2018-10-05')],
names=['key', 'date'])

其余部分借用了@ScottBoston 的这一精彩回答

  1. set_index+reindex,带多索引:
df.set_index(['key', 'date'])
.reindex(midx)
size   number
key   date                           
Hello 2018-10-02  153.2 K  12345.0
2018-10-03  153.2 K  12345.0
2018-10-04  153.2 K  12345.0
2018-10-05      NaN      NaN
2018-10-06  153.2 K  12345.0
No    2018-10-01  153.2 K  12345.0
2018-10-02      NaN      NaN
2018-10-03  153.2 K  12345.0
2018-10-04      NaN      NaN
2018-10-05  153.2 K  12345.0
  1. loc保留NaN行:
df.set_index(['key', 'date']) 
.reindex(midx) 
.loc[lambda df_: df_['size'].isna()]

size  number
key   date                      
Hello 2018-10-05  NaN     NaN
No    2018-10-02  NaN     NaN
2018-10-04  NaN     NaN
  1. 将剩余索引转换为to_frame
(df.set_index(['key', 'date'])
.reindex(midx)
.loc[lambda df_: df_['size'].isna()]
.index
.to_frame(index=False))
key       date
0  Hello 2018-10-05
1     No 2018-10-02
2     No 2018-10-04

相关内容

  • 没有找到相关文章

最新更新