在某些情况下,使用tzwhere的Python时区确定和操作会产生对象,而不是日期时间64



我有数百万行包含UTC日期时间64,其中包含时区信息和纬度/经度对。对于每一行,我需要知道本地时区,并创建一个包含本地时间的列。为此,我使用tzwhere包。

说明问题的简单数据集:

TimeUTC,Latitude,Longitude
2021-10-11 12:16:00+00:00,42.289723,-71.031715
2021-10-11 12:16:00+00:00,0,0

我用来获取时区,然后创建本地时间值的函数

def tz_from_location(row, tz):
# Hardcoded in an effort to circumvent the problem. The returned value is still
# an object, even though row.TimeUTC is a datetime64
if (row.Latitude == 0) & (row.Longitude == 0):
print ("0,0")
ret_val = row.TimeUTC.tz_convert('UTC')
return (row.TimeUTC)
try:
# forceTZ=True tells it to find the nearest timezone for places without one
tzname = tz.tzNameAt(row.Latitude, row.Longitude, forceTZ=True)
if (tzname == 'uninhabited'):
return(row.TimeUTC)

ret_val = row.TimeUTC.tz_convert(tzname)
#        ret_val = ret_val.to_pydatetime()
except Exception as e:
print(f'tz_from_location - Latitude: {row.Latitude} Longitude: {row.Longitude}')
print(f'Error {e}')
exit(-1)

return(ret_val)

调用函数如下:

from tzwhere import tzwhere
from datetime import datetime
bug = pd.read_csv('./foo.csv')
# Initialize tzwhere
tz = tzwhere.tzwhere(forceTZ=True)
# Create the UTC column
bug['TimeUTC'] = bug['TimeUTC'].astype('datetime64[ns]')
# The original data comes in with a timezone that is of the local computer, not
# the location. Turn that into UTC
bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')
# Now call the function
bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)
# We are putting this into PostgreSQL. If you try to put a TZ aware datetime
# in it will automatically convert it to UTC. So, we need to make this value
# naive and then upload it
bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')

最后一行在具有0,0的行上抛出错误,但在任何其他行上都没有。

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/8d/jp8b0rbx5rq0l8p8cnbb5k_r0000gn/T/ipykernel_49416/4161114700.py in <module>
4 bug['TimeUTC'] = bug['TimeUTC'].dt.tz_localize('US/Eastern', ambiguous='NaT', nonexistent='shift_forward')
5 bug['TimeLocal'] = bug.apply(geospatial.tz_from_location, tz=tz, axis=1)
----> 6 bug['TimeLocal'] = bug['TimeLocal'].dt.tz_localize(None, ambiguous='infer')
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/generic.py in __getattr__(self, name)
5459             or name in self._accessors
5460         ):
-> 5461             return object.__getattribute__(self, name)
5462         else:
5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
178             # we're accessing the attribute of the class, i.e., Dataset.geo
179             return self._accessor
--> 180         accessor_obj = self._accessor(obj)
181         # Replace the property with the accessor object. Inspired by:
182         # https://www.pydanny.com/cached-property.html
~/miniforge3/envs/a50-dev/lib/python3.9/site-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
492             return PeriodProperties(data, orig)
493 
--> 494         raise AttributeError("Can only use .dt accessor with datetimelike values")
AttributeError: Can only use .dt accessor with datetimelike values

这是因为第一行包含日期时间64,但第二行是对象。

以下是调用前的TimeUTC值:

bug.TimeUTC
0   2021-10-11 12:16:00-04:00
1   2021-10-11 12:16:00-04:00
Name: TimeUTC, dtype: datetime64[ns, US/Eastern]

这是添加了TimeLocal的数据帧:

bug.TimeLocal
0    2021-10-11 12:16:00-04:00
1    2021-10-11 12:16:00-04:00
Name: TimeLocal, dtype: object

如果您查看单个行,第一行是正确的,但第二行是对象。

我所有的努力都失败了,返回的东西没有显示为0,0行的对象。我肯定我错过了一些简单的东西。

以下是一些建议;以DataFrame 为例

TimeUTC   Latitude  Longitude
0  2021-10-11 12:16:00+00:00  42.289723 -71.031715
1  2021-10-11 12:16:00+00:00   0.000000   0.000000

确保将datetime列解析为datetime数据类型:

df['TimeUTC'] = pd.to_datetime(df['TimeUTC'])

然后,您可以重构从lat/long派生tz的函数,例如

from timezonefinder import TimezoneFinder
def tz_from_location(row, _tf=TimezoneFinder()):
# if lat/lon aren't specified, we just want the existing name (e.g. UTC)
if (row.Latitude == 0) & (row.Longitude == 0):
return row.TimeUTC.tzname()
# otherwise, try to find tz name
tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
if tzname: # return the name if it is not None
return tzname
return row.TimeUTC.tzname() # else return existing name

我建议使用timezonefinder,因为我发现它是一个更有效的可靠的-docs,github。

现在你可以很容易地申请&创建一个转换为tz的列:

df['TimeLocal'] = df.apply(lambda row: row['TimeUTC'].tz_convert(tz_from_location(row)), axis=1)

给你

TimeUTC   Latitude  Longitude                  TimeLocal
0 2021-10-11 12:16:00+00:00  42.289723 -71.031715  2021-10-11 08:16:00-04:00
1 2021-10-11 12:16:00+00:00   0.000000   0.000000  2021-10-11 12:16:00+00:00
df['TimeLocal'].iloc[0]
Out[2]: Timestamp('2021-10-11 08:16:00-0400', tz='America/New_York')
df['TimeLocal'].iloc[1]
Out[3]: Timestamp('2021-10-11 12:16:00+0000', tz='UTC')

(!(但是由于您的TimeLocal列中有混合时区,因此整个系列的数据类型将为object-无法绕过这一点,pandas datetime就是这样处理一个系列中的混合时区的。


附录

如果我们还想要一次使用时区名称的列,我们可以让函数返回一个元组,并在调用中使用expand来应用:

def convert_to_location_tz(row, _tf=TimezoneFinder()):
# if lat/lon aren't specified, we just want the existing name (e.g. UTC)
if (row.Latitude == 0) & (row.Longitude == 0):
return (row.TimeUTC.tzname(), row.TimeUTC)
# otherwise, try to find tz name
tzname = _tf.timezone_at(lng=row.Longitude, lat=row.Latitude)
if tzname: # return the name if it is not None
return (tzname, row.TimeUTC.tz_convert(tzname))
return (row.TimeUTC.tzname(), row.TimeUTC) # else return existing name
df[['tzname', 'TimeLocal']] = df.apply(lambda row: convert_to_location_tz(row), axis=1, result_type='expand')
df[['tzname', 'TimeLocal']]
Out[9]: 
tzname                  TimeLocal
0  America/New_York  2021-10-11 08:16:00-04:00
1               UTC  2021-10-11 12:16:00+00:00

最新更新