我有一个DataFrame(6M行(,有两列,一列包含本地时间(时区naive(,另一列包含时区。类似这样的东西:
| | SCHEDULED_DEPARTURE | ORIGIN_TZ |
|---:|:----------------------|:--------------------|
| 0 | 2020-11-30 11:40:00 | America/New_York |
| 1 | 2020-11-30 16:51:00 | America/New_York |
| 2 | 2020-11-30 09:05:00 | America/Chicago |
| 3 | 2020-11-30 19:18:00 | America/Chicago |
| 4 | 2020-11-30 10:36:00 | America/New_York |
| 5 | 2020-11-30 12:10:00 | America/Los_Angeles |
| 6 | 2020-11-30 16:05:00 | America/New_York |
| 7 | 2020-11-30 12:14:00 | America/New_York |
| 8 | 2020-11-30 16:05:00 | America/New_York |
| 9 | 2020-11-30 12:40:00 | America/Chicago |
我试图使用for
例程来定位SCHEDULED_DEPARTURE
的每一行,该例程按每个时区对df
进行子集设置,添加时区并保持循环:
for tz in df['ORIGIN_TZ'].unique():
mask_tz = (df['ORIGIN_TZ'] == tz)
df.loc[mask_tz,'SCHEDULED_DEPARTURE'] = df.loc[mask_tz,'SCHEDULED_DEPARTURE'].dt.tz_localize(tz)
奇怪的是,有时它工作,有时它返回以下错误:
AttributeError:只能使用具有类似日期时间值的.dt访问器
提取SCHEDULED_DEPARTURE
列时,类型显然是datetime,如:
Name: SCHEDULED_DEPARTURE, Length: 5714008, dtype: datetime64[ns]
你知道怎么解决这个问题吗?每列可以有一个以上的时区吗?
以下是复制电子样本df:的代码
df = pd.DataFrame({'SCHEDULED_DEPARTURE': {0: pd.Timestamp('2020-11-30 10:15:00'), 1: pd.Timestamp('2020-11-30 07:55:00'), 2: pd.Timestamp('2020-11-30 06:00:00'), 3: pd.Timestamp('2020-11-30 16:23:00'), 4: pd.Timestamp('2020-11-30 07:35:00'), 5: pd.Timestamp('2020-11-30 08:00:00'), 6: pd.Timestamp('2020-11-30 08:50:00'), 7: pd.Timestamp('2020-11-30 13:45:00'), 8: pd.Timestamp('2020-11-30 10:15:00'), 9: pd.Timestamp('2020-11-30 20:00:00')}, 'ORIGIN_TZ': {0: 'America/New_York', 1: 'America/New_York', 2: 'America/Denver', 3: 'America/New_York', 4: 'America/Chicago', 5: 'America/Chicago', 6: 'America/Los_Angeles', 7: 'America/Chicago', 8: 'America/New_York', 9: 'America/Los_Angeles'}})
一旦完成:
df.loc[mask_tz,'SCHEDULED_DEPARTURE'] = df.loc[mask_tz,'SCHEDULED_DEPARTURE'].dt.tz_localize(tz)
您的列变为对象dtype,下一次.dt
访问失败。尝试复制:
s = df['SCHEDULED_DEPARTURE'].copy()
for tz in df['ORIGIN_TZ'].unique():
mask_tz = (df['ORIGIN_TZ'] == tz)
df.loc[mask_tz,'SCHEDULED_DEPARTURE'] = s.loc[mask_tz].dt.tz_localize(tz)
则df.loc[0,'SCHEDULED_DEPARTURE']
将给出:
Timestamp('2020-11-30 10:15:00-0500', tz='America/New_York')
不过,您的SCHEDULED_DEPARTURE
列仍然是object
数据类型。