为什么to_datetime可以正常处理正时区 (UTC+),但不能处理负时区 (UTC-)?



我一直在使用 Pandas 来转换 .CSV 文件转换为在另一个系统上可读的格式,我即将完成它,但我只是无法让它处理负时区(UTC-1、-2 等)

这是我使用的代码,它不是最整洁的,但它完成了 UTC+ 时区的工作,你能明白为什么它可能无法正确处理 UTC- 时区吗?

import pandas as pd
from datetime import datetime
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = 'utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.detach(), encoding = 'utf-8')
df = pd.read_csv('CONCACAF_First_Round.csv', index_col=False)
df['starttime'] = df['starttime'].str.replace('s+', '')
df.insert(loc=2, column='season', value='2020')
df.insert(loc=6, column='awayscore', value='')
df.insert(loc=8, column='round_a', value='1')
df['venue'] = df['venue'].str.split(',').str[0]
df[['homescore', 'awayscore']] = df['homescore'].str.split('–',expand=True)
df['awayscore'] = df['awayscore'].str.split(' ').str[0]
df['starttime'] = df['starttime'].str.replace('UTC', ' UTC')
df['datepicker'] = df['datepicker'] + (' ') + df['starttime']
del df['starttime']
df[['datepicker', 'time', 'UTC']] = df.datepicker.str.split(" ", expand=True)
df['datepicker'] = df['datepicker'] + ' ' + df['time']
del df['time']
df['datepicker'] = df['datepicker'] + ' ' + df['UTC']
df['datepicker'] = df['datepicker'].str.replace('±', '+')
df['datepicker'] = df['datepicker'].str.replace('UTC', '')
del df['UTC']
df['datepicker'] = pd.to_datetime(df['datepicker'], utc=True)
df.insert(loc=1, column='starttime', value='')
df['starttime'] = df['datepicker'].dt.strftime('%H:%M:%S')
df['datepicker'] = df['datepicker'].dt.strftime('%Y-%m-%d')
print(df.head(10))

这就是它以负时区返回的内容

dateutil.parser._parser.ParserError: Unknown string format: 2015-03-25 19:30 −4

这是它在 UTC+ 时区正常工作时返回的内容

datepicker starttime season  hometeam awayteam homescore awayscore              venue round_a
0  2019-09-04  13:00:00   2020  Ethiopia  Lesotho         0         0  Bahir Dar Stadium       1

数据:CONCACAF_First_Round.csv

# copy the data to the clipboard and read with
df = pd.read_clipboard(sep=',')
datepicker,starttime,hometeam,awayteam,homescore,venue
2015-03-25,19:30 UTC−4,Bahamas,Bermuda,0–5,"Thomas Robinson Stadium, Nassau"
2015-03-29,15:00 UTC−3,Bermuda,Bahamas,3–0,"Bermuda National Stadium, Devonshire"
2015-03-26,19:00 UTC−4,British Virgin Islands,Dominica,2–3,"Windsor Park, Roseau (Dominica)[note 2]"
2015-03-29,17:00 UTC−4,Dominica,British Virgin Islands,0–0,"Windsor Park, Roseau"
2015-03-22,19:00 UTC−4,Barbados,U.S. Virgin Islands,0–1,"Barbados National Stadium, Bridgetown"
2015-03-26,15:30 UTC−4,U.S. Virgin Islands,Barbados,0–4,"Addelita Cancryn Junior High School Ground, Charlotte Amalie"
2015-03-23,20:00 UTC−4,Saint Kitts and Nevis,Turks and Caicos Islands,6–2,"Warner Park, Basseterre"
2015-03-26,19:00 UTC−4,Turks and Caicos Islands,Saint Kitts and Nevis,2–6,"TCIFA National Academy, Providenciales"
2015-03-23,18:00 UTC−6,Nicaragua,Anguilla,5–0,"Nicaragua National Football Stadium, Managua"
2015-03-29,17:00 UTC−4,Anguilla,Nicaragua,0–3,"Ronald Webster Park, The Valley"
2015-03-25,20:00 UTC−6,Belize,Cayman Islands,0–0,"FFB Stadium, Belmopan"
2015-03-29,19:00 UTC−5,Cayman Islands,Belize,1–1,"Truman Bodden Sports Complex, George Town"
2015-03-27,20:00 UTC−4,Curaçao,Montserrat,2–1,"Ergilio Hato Stadium, Willemstad"
2015-03-31,19:00 UTC−4,Montserrat,Curaçao,2–2,"Blakes Estate Stadium, St. John's"

fix_starttime细分

  • 数据的主要问题是符号不正确,-.请注意,它略长。
    • '−'替换为'-'
    • ' UTC'替换为':00 '
    • '19:30 UTC−4'转换为'19:30:00 -4'
  • fix_date函数执行以下操作,'19:30:00 -4'
    • y = v.split(' ')y = ['19:30:00', '-4']
    • z= y[1][0] + y[1][1:].zfill(2) + '00'z = '-' + '0400'z = '-0400'
      • str.zfill用零填充str以适合长度width,在本例中为2。所以'4'变得'04''10'保持'10'
    • t = y[0]t = '19:30:00'
    • return 19:30:00-0400
  • datepicker和格式正确的starttime结合起来,并将其转换为datetime64[ns, UTC]格式的列。
import pandas as pd
# read the file
df = pd.read_csv('CONCACAF_First_Round.csv')
# function
def fix_starttime(v: str) -> str:
y = v.split(' ')
z = y[1][0] + y[1][1:].zfill(2) + '00'
t = y[0]
return t + z

# fix starttime
df['starttime'] = df.starttime.str.replace('−', '-').str.replace(' UTC', ':00 ').apply(fix_starttime)
# create datetime column in datetime64[ns, UTC] format
df['utc_datetime'] = pd.to_datetime(df.datepicker + 'T' + df.starttime, utc=True)

合并

# read the file
df = pd.read_csv('CONCACAF_First_Round.csv')
# fix starttime
df['starttime'] = df.starttime.str.replace('−', '-').str.replace(' UTC', ':00 ').str.split(' ').apply(lambda y: y[0] + y[1][0] + y[1][1:].zfill(2) + '00')
# create datetime column in datetime64[ns, UTC] format
df['utc_datetime'] = pd.to_datetime(df.datepicker + 'T' + df.starttime, utc=True)

额外

  • 一旦有具有日期时间格式的列,请使用 pandas.dt访问器提取日期时间组件。
# extract time from utc_datetime
df.utc_datetime.dt.time
# extract date from utc_datetime
df.utc_datetime.dt.date

结果

datepicker      starttime                  hometeam                  awayteam homescore                                                         venue              utc_datetime
2015-03-25  19:30:00-0400                   Bahamas                   Bermuda       0–5                               Thomas Robinson Stadium, Nassau 2015-03-25 23:30:00+00:00
2015-03-29  15:00:00-0300                   Bermuda                   Bahamas       3–0                          Bermuda National Stadium, Devonshire 2015-03-29 18:00:00+00:00
2015-03-26  19:00:00-0400    British Virgin Islands                  Dominica       2–3                       Windsor Park, Roseau (Dominica)[note 2] 2015-03-26 23:00:00+00:00
2015-03-29  17:00:00-0400                  Dominica    British Virgin Islands       0–0                                          Windsor Park, Roseau 2015-03-29 21:00:00+00:00
2015-03-22  19:00:00-0400                  Barbados       U.S. Virgin Islands       0–1                         Barbados National Stadium, Bridgetown 2015-03-22 23:00:00+00:00
2015-03-26  15:30:00-0400       U.S. Virgin Islands                  Barbados       0–4  Addelita Cancryn Junior High School Ground, Charlotte Amalie 2015-03-26 19:30:00+00:00
2015-03-23  20:00:00-0400     Saint Kitts and Nevis  Turks and Caicos Islands       6–2                                       Warner Park, Basseterre 2015-03-24 00:00:00+00:00
2015-03-26  19:00:00-0400  Turks and Caicos Islands     Saint Kitts and Nevis       2–6                        TCIFA National Academy, Providenciales 2015-03-26 23:00:00+00:00
2015-03-23  18:00:00-0600                 Nicaragua                  Anguilla       5–0                  Nicaragua National Football Stadium, Managua 2015-03-24 00:00:00+00:00
2015-03-29  17:00:00-0400                  Anguilla                 Nicaragua       0–3                               Ronald Webster Park, The Valley 2015-03-29 21:00:00+00:00
2015-03-25  20:00:00-0600                    Belize            Cayman Islands       0–0                                         FFB Stadium, Belmopan 2015-03-26 02:00:00+00:00
2015-03-29  19:00:00-0500            Cayman Islands                    Belize       1–1                     Truman Bodden Sports Complex, George Town 2015-03-30 00:00:00+00:00
2015-03-27  20:00:00-0400                   Curaçao                Montserrat       2–1                              Ergilio Hato Stadium, Willemstad 2015-03-28 00:00:00+00:00
2015-03-31  19:00:00-0400                Montserrat                   Curaçao       2–2                             Blakes Estate Stadium, St. John's 2015-03-31 23:00:00+00:00

你的"减号"并不都是减号。 例如,在您的错误消息中:

Unknown string format: 2015-03-25 19:30 −4

如果将看起来像减号的倒数第二个字符作为字符串复制到 Python 中,您将看到:

>>> '−'
'xe2x88x92'

那就是U+2212,一个特殊的Unicode"减号",与键盘上的"-"不同。

您需要将这些Unicode"减号"替换为普通的ASCII减号,然后它应该可以工作。

在 2013 年有一个关于这个的 Python 思想讨论,"Unicode 减去数字转换中的符号",其中有人说:

至于支持非 ASCII 加号和减号,我原则上很热,但在实践中不冷不热。我认为这将是一个不错的选择,如果有人做了工作来确定哪些角色应该被接受,我会支持将其添加为新功能。但我不认为缺乏对非 ASCII 数字符号的支持是一个错误。

这条线索最终是徒劳的,因为每个人都多次用诸如泰国数字是否应该在float()中使用之类的次要话题来打死他们的爱好马。 如果有人承担这项任务,则更集中地支持 U+2212 和一些 Unicode"加号"可能会更好。

最新更新