我有一个带有DataTime列的数据帧(时区采用不同的格式)。时区似乎是UTC,但我想将该列转换为pd.to_datetime
并且失败了。这就是问题#1。由于失败,我无法对时间段执行任何日期时间操作,例如按日期分组列/计算天数/按一天中的小时分组等等。这是我的数据帧df_res
DateTime
2017-11-02 19:49:28-07:00
2017-11-27 07:32:22-08:00
2017-12-27 17:01:15-08:00
命令的输出
df_res["DateTime"] = df_res["DateTime"].dt.tz_convert('America/New_York')
AttributeError: Can only use .dt accessor with datetimelike values
我转换为datetime
df_res['DateTime'] = pd.to_datetime(df_res['DateTime'])
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
我觉得我在兜圈子。我需要将列转换为日期时间才能执行操作,为此我需要将它们都设置为相同的时区,但除非它是日期时间对象,否则我不能具有相同的时区,所以我如何才能最好地解决这个问题。 我确实参考了以前的帖子,但它们似乎尽可能容易地转换为日期时间:
将日期时间列转换为其他时区熊猫 将熊猫时区感知日期时间索引转换为朴素时间戳,但在某些时区
我认为没有必要应用 lambdas:
df_res['DateTime'] = pd.to_datetime(df_res['DateTime'], utc=True)
文档: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
你可以检查这个:
df = pd.DataFrame({
'time': [
'2017-11-02 19:49:28-08:00',
'2017-11-27 07:32:22-07:00',
'2017-12-27 17:01:15-07:00'
]
})
df['time'] = pd.to_datetime(df['time'])
df['time'].apply(lambda x: pd.to_datetime(x).tz_localize('US/Eastern'))
0 2017-11-03 03:49:28-04:00
1 2017-11-27 14:32:22-05:00
2 2017-12-28 00:01:15-05:00
Name: time, dtype: datetime64[ns, US/Eastern]
在这里发帖是因为我花了几个小时来找出这个问题的答案,因为一般情况下,你可能会在 UTC 以外的另一个时区拥有朴素的日期时间。
这是我提出的解决方案。如果有人对熊猫/numpy有更深入的了解,我会很高兴指出是否有任何方法可以提高其性能。不过,对于有类似问题的人的起点,我可能会派上用场。
from datetime import datetime
import pandas as pd
from pandas import Series
from pandas.api.types import is_datetime64_any_dtype as is_datetime
def ensure_datetime(series: Series, timezone: str):
"""
Ensures that the `series` is a datetime series of dtype datetime64[ns, timezone]
- Convert tz aware values to `timezone`.
- Assume naive values are on `timezone` and make them aware.
- Handle None values and convert them to NaT (so we can accomplish the dtype requirement).
"""
if series.dtype == pd.DatetimeTZDtype(tz=timezone):
return series
are_datetime = series.apply(lambda x: isinstance(x, datetime)).astype(bool)
# Convert only values that are not already datetime, otherwise if there are
# tz-aware values pandas will raise: Tz-aware datetime.datetime cannot
# be converted to datetime64 unless utc=True.
# We cannot set utc=True because pandas will assume naive values to be on UTC
# but we need naive values to be considered on `timezone`.
series = series.mask(
~are_datetime, pd.to_datetime(series[~are_datetime], errors="coerce")
)
# Localize naive values to `timezone`
are_unaware = series.apply(lambda x: not pd.isna(x) and x.tzinfo is None).astype(
bool
)
series = series.mask(
are_unaware, pd.to_datetime(series[are_unaware]).dt.tz_localize(timezone)
)
# Now that we don't have any naive value we can normalize all to UTC and
# then convert to `timezone`.
series = pd.to_datetime(series, utc=True).dt.tz_convert(timezone)
return series
def test_ensure_datetime():
series = pd.Series(
["2022-12-31 16:00:00-08:00", "2023-01-01", "2023-01-01 12:30", None]
)
series = ensure_datetime(series, "America/New_York")
assert is_datetime(series)
assert list(series) == [
pd.Timestamp("2022-12-31 19:00", tz="America/New_York"),
pd.Timestamp("2023-01-01 00:00", tz="America/New_York"),
pd.Timestamp("2023-01-01 12:30", tz="America/New_York"),
pd.NaT,
]
series = ensure_datetime(series.dt.date, "America/New_York")
assert is_datetime(series)
assert list(series) == [
pd.Timestamp("2022-12-31 00:00", tz="America/New_York"),
pd.Timestamp("2023-01-01 00:00", tz="America/New_York"),
pd.Timestamp("2023-01-01 00:00", tz="America/New_York"),
pd.NaT,
]
# Mix aware timestamps with naive
series = pd.Series(
[
pd.Timestamp("2022-12-31 12:00", tz="America/New_York"),
pd.Timestamp("2022-12-31 12:00"),
]
)
series = ensure_datetime(series, "America/New_York")
assert list(series) == [
pd.Timestamp("2022-12-31 12:00", tz="America/New_York"),
pd.Timestamp("2022-12-31 12:00", tz="America/New_York"),
]