我想为列中的每个键获取丢失的日期。我的数据帧如下,
size number key date
0 153.2 K 12345 Hello 20181002
1 153.2 K 12345 No 20181001
2 153.2 K 12345 Hello 20181003
3 153.2 K 12345 No 20181003
4 153.2 K 12345 Hello 20181004
5 153.2 K 12345 No 20181005
6 153.2 K 12345 Hello 20181006
我想得到以下结果
key date
No 20181002
No 20181004
Hello 20181005
对于键="0";否";没有日期20181002、20181004。我想说的是,每个键都有不同的日期范围,输出应该是特定日期范围中每个键的缺失日期
我使用了以下代码,但它采用了两个键的日期,开始日期为20181001,结束日期为20181006
import pandas as pd
df = pd.read_csv('4002.csv')
print(df.head(1))
dates = pd.date_range(*pd.to_datetime(df['date'], format='%Y%m%d',errors='coerce').agg(['min', 'max']), freq='D').strftime('%Y%m%d').astype(int)
df1=pd.DataFrame(index=pd.Index(df['key'].unique(), name='key'),columns=dates.difference(df['date'])).reset_index().melt('key').drop(columns=['value'])
print(df1)
df1.to_csv('4002output.csv', index=False)
让我们试试:
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
midx = pd.MultiIndex.from_frame(
df.groupby('key')['date'].agg(['min', 'max'])
.apply(lambda x: pd.date_range(x['min'], x['max']), axis=1)
.explode()
.reset_index(),
names=['key', 'date']
)
new_df = (df.set_index(['key', 'date'])
.reindex(midx)
.loc[lambda df_: df_['size'].isna()]
.index
.to_frame(index=False)
.rename(columns={0: 'date'}))
new_df
:
key date
0 Hello 2018-10-05
1 No 2018-10-02
2 No 2018-10-04
可选将日期转换回字符串:
new_df['date'] = new_df['date'].dt.strftime('%Y%m%d')
key date
0 Hello 20181005
1 No 20181002
2 No 20181004
数据帧:
df = pd.DataFrame({
'size': ['153.2 K', '153.2 K', '153.2 K', '153.2 K', '153.2 K', '153.2 K',
'153.2 K'],
'number': [12345, 12345, 12345, 12345, 12345, 12345, 12345],
'key': ['Hello', 'No', 'Hello', 'No', 'Hello', 'No', 'Hello'],
'date': [20181002, 20181001, 20181003, 20181003, 20181004, 20181005,
20181006]
})
解释:
Groupby aggregate
每个key
组的min
和max
值:
df.groupby('key')['date'].agg(['min', 'max'])
min max
key
Hello 2018-10-02 2018-10-06
No 2018-10-01 2018-10-05
- 将其转换为
Date Range
:
df.groupby('key')['date'].agg(['min', 'max'])
.apply(lambda x: pd.date_range(x['min'], x['max']), axis=1)
key
Hello DatetimeIndex(['2018-10-02', '2018-10-03', '20...
No DatetimeIndex(['2018-10-01', '2018-10-02', '20...
dtype: object
explode
分为多行:
df.groupby('key')['date'].agg(['min', 'max'])
.apply(lambda x: pd.date_range(x['min'], x['max']), axis=1)
.explode()
key
Hello 2018-10-02
Hello 2018-10-03
Hello 2018-10-04
Hello 2018-10-05
Hello 2018-10-06
No 2018-10-01
No 2018-10-02
No 2018-10-03
No 2018-10-04
No 2018-10-05
dtype: datetime64[ns]
- 将此帧转换为
MultiIndex.from_frame
:
midx = pd.MultiIndex.from_frame(
df.groupby('key')['date'].agg(['min', 'max'])
.apply(lambda x: pd.date_range(x['min'], x['max']), axis=1)
.explode()
.reset_index(),
names=['key', 'date']
)
MultiIndex([('Hello', '2018-10-02'),
('Hello', '2018-10-03'),
('Hello', '2018-10-04'),
('Hello', '2018-10-05'),
('Hello', '2018-10-06'),
( 'No', '2018-10-01'),
( 'No', '2018-10-02'),
( 'No', '2018-10-03'),
( 'No', '2018-10-04'),
( 'No', '2018-10-05')],
names=['key', 'date'])
其余部分借用了@ScottBoston 的这一精彩回答
set_index
+reindex
,带多索引:
df.set_index(['key', 'date'])
.reindex(midx)
size number
key date
Hello 2018-10-02 153.2 K 12345.0
2018-10-03 153.2 K 12345.0
2018-10-04 153.2 K 12345.0
2018-10-05 NaN NaN
2018-10-06 153.2 K 12345.0
No 2018-10-01 153.2 K 12345.0
2018-10-02 NaN NaN
2018-10-03 153.2 K 12345.0
2018-10-04 NaN NaN
2018-10-05 153.2 K 12345.0
- 用
loc
保留NaN
行:
df.set_index(['key', 'date'])
.reindex(midx)
.loc[lambda df_: df_['size'].isna()]
size number
key date
Hello 2018-10-05 NaN NaN
No 2018-10-02 NaN NaN
2018-10-04 NaN NaN
- 将剩余索引转换为
to_frame
:
(df.set_index(['key', 'date'])
.reindex(midx)
.loc[lambda df_: df_['size'].isna()]
.index
.to_frame(index=False))
key date
0 Hello 2018-10-05
1 No 2018-10-02
2 No 2018-10-04