检测熊猫中具有 NaN 值的所有行的连续时间戳



我想在数据帧中检测连续行集的开始和结束(Datetime(,所有值都是NaN。

将结果存储在元组数组中的最佳方法是什么,每组日期时间的开始和结束都有 NaN 值?

例如,使用下面的数据帧,元组应该是这样的:

missing_datetimes = [('2018-10-10 22:00:00', '2018-10-11 00:00:00 '),
('2018-10-11 02:00:00','2018-10-11 02:00:00'), ('2018-10-11 04:00:00', '2018-10-11 04:00:00')

数据帧示例:

-------------+---------------------+------------+------------+
| geo_id     | Datetime            |  Variable1 |  Variable2 |    
+------------+---------------------+------------+------------+
| 1          | 2018-10-10 18:00:00 |     20     |     10     |
| 2          | 2018-10-10 18:00:00 |     22     |     10     |
| 1          | 2018-10-10 19:00:00 |     20     |     nan    |
| 2          | 2018-10-10 19:00:00 |     21     |     nan    |
| 1          | 2018-10-10 20:00:00 |     30     |     nan    |
| 2          | 2018-10-10 20:00:00 |     30     |     nan    |
| 1          | 2018-10-10 21:00:00 |     nan    |     5      |
| 2          | 2018-10-10 21:00:00 |     nan    |     5      |
| 1          | 2018-10-10 22:00:00 |     nan    |     nan    |
| 1          | 2018-10-10 23:00:00 |     nan    |     nan    |
| 1          | 2018-10-11 00:00:00 |     nan    |     nan    |
| 1          | 2018-10-11 01:00:00 |     5      |     2      |
| 1          | 2018-10-11 02:00:00 |     nan    |     nan    |
| 1          | 2018-10-11 03:00:00 |     2      |     1      |
| 1          | 2018-10-11 04:00:00 |     nan    |     nan    |
+------------+---------------------+------------+------------+

更新:如果某些日期时间重复怎么办?

您可能需要使用带有条件的groupby

s=df.set_index('Datetime').isnull().all(axis=1)
df.loc[s,'Datetime'].groupby((~s).cumsum()[s]).agg(['first','last']).apply(tuple,1).tolist()
# find the all nan value and if they are consecutive we pull them into one group
Out[89]: 
[('2018-10-1022:00:00', '2018-10-1100:00:00'),
('2018-10-1102:00:00', '2018-10-1102:00:00'),
('2018-10-1104:00:00', '2018-10-1104:00:00')]

最新更新