我一直在使用 Pandas 来转换 .CSV 文件转换为在另一个系统上可读的格式,我即将完成它,但我只是无法让它处理负时区(UTC-1、-2 等)
这是我使用的代码,它不是最整洁的,但它完成了 UTC+ 时区的工作,你能明白为什么它可能无法正确处理 UTC- 时区吗?
import pandas as pd
from datetime import datetime
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = 'utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.detach(), encoding = 'utf-8')
df = pd.read_csv('CONCACAF_First_Round.csv', index_col=False)
df['starttime'] = df['starttime'].str.replace('s+', '')
df.insert(loc=2, column='season', value='2020')
df.insert(loc=6, column='awayscore', value='')
df.insert(loc=8, column='round_a', value='1')
df['venue'] = df['venue'].str.split(',').str[0]
df[['homescore', 'awayscore']] = df['homescore'].str.split('–',expand=True)
df['awayscore'] = df['awayscore'].str.split(' ').str[0]
df['starttime'] = df['starttime'].str.replace('UTC', ' UTC')
df['datepicker'] = df['datepicker'] + (' ') + df['starttime']
del df['starttime']
df[['datepicker', 'time', 'UTC']] = df.datepicker.str.split(" ", expand=True)
df['datepicker'] = df['datepicker'] + ' ' + df['time']
del df['time']
df['datepicker'] = df['datepicker'] + ' ' + df['UTC']
df['datepicker'] = df['datepicker'].str.replace('±', '+')
df['datepicker'] = df['datepicker'].str.replace('UTC', '')
del df['UTC']
df['datepicker'] = pd.to_datetime(df['datepicker'], utc=True)
df.insert(loc=1, column='starttime', value='')
df['starttime'] = df['datepicker'].dt.strftime('%H:%M:%S')
df['datepicker'] = df['datepicker'].dt.strftime('%Y-%m-%d')
print(df.head(10))
这就是它以负时区返回的内容
dateutil.parser._parser.ParserError: Unknown string format: 2015-03-25 19:30 −4
这是它在 UTC+ 时区正常工作时返回的内容
datepicker starttime season hometeam awayteam homescore awayscore venue round_a
0 2019-09-04 13:00:00 2020 Ethiopia Lesotho 0 0 Bahir Dar Stadium 1
数据:CONCACAF_First_Round.csv
# copy the data to the clipboard and read with
df = pd.read_clipboard(sep=',')
datepicker,starttime,hometeam,awayteam,homescore,venue
2015-03-25,19:30 UTC−4,Bahamas,Bermuda,0–5,"Thomas Robinson Stadium, Nassau"
2015-03-29,15:00 UTC−3,Bermuda,Bahamas,3–0,"Bermuda National Stadium, Devonshire"
2015-03-26,19:00 UTC−4,British Virgin Islands,Dominica,2–3,"Windsor Park, Roseau (Dominica)[note 2]"
2015-03-29,17:00 UTC−4,Dominica,British Virgin Islands,0–0,"Windsor Park, Roseau"
2015-03-22,19:00 UTC−4,Barbados,U.S. Virgin Islands,0–1,"Barbados National Stadium, Bridgetown"
2015-03-26,15:30 UTC−4,U.S. Virgin Islands,Barbados,0–4,"Addelita Cancryn Junior High School Ground, Charlotte Amalie"
2015-03-23,20:00 UTC−4,Saint Kitts and Nevis,Turks and Caicos Islands,6–2,"Warner Park, Basseterre"
2015-03-26,19:00 UTC−4,Turks and Caicos Islands,Saint Kitts and Nevis,2–6,"TCIFA National Academy, Providenciales"
2015-03-23,18:00 UTC−6,Nicaragua,Anguilla,5–0,"Nicaragua National Football Stadium, Managua"
2015-03-29,17:00 UTC−4,Anguilla,Nicaragua,0–3,"Ronald Webster Park, The Valley"
2015-03-25,20:00 UTC−6,Belize,Cayman Islands,0–0,"FFB Stadium, Belmopan"
2015-03-29,19:00 UTC−5,Cayman Islands,Belize,1–1,"Truman Bodden Sports Complex, George Town"
2015-03-27,20:00 UTC−4,Curaçao,Montserrat,2–1,"Ergilio Hato Stadium, Willemstad"
2015-03-31,19:00 UTC−4,Montserrat,Curaçao,2–2,"Blakes Estate Stadium, St. John's"
fix_starttime
细分
- 数据的主要问题是
−
符号不正确,-
.请注意,它略长。- 将
'−'
替换为'-'
- 将
' UTC'
替换为':00 '
- 将
'19:30 UTC−4'
转换为'19:30:00 -4'
- 将
fix_date
函数执行以下操作,'19:30:00 -4'
y = v.split(' ')
→y = ['19:30:00', '-4']
z= y[1][0] + y[1][1:].zfill(2) + '00'
→z = '-' + '0400'
→z = '-0400'
str.zfill
用零填充str
以适合长度width
,在本例中为2。所以'4'
变得'04'
,'10'
保持'10'
。
t = y[0]
→t = '19:30:00'
return 19:30:00-0400
- 将
datepicker
和格式正确的starttime
结合起来,并将其转换为datetime64[ns, UTC]
格式的列。
import pandas as pd
# read the file
df = pd.read_csv('CONCACAF_First_Round.csv')
# function
def fix_starttime(v: str) -> str:
y = v.split(' ')
z = y[1][0] + y[1][1:].zfill(2) + '00'
t = y[0]
return t + z
# fix starttime
df['starttime'] = df.starttime.str.replace('−', '-').str.replace(' UTC', ':00 ').apply(fix_starttime)
# create datetime column in datetime64[ns, UTC] format
df['utc_datetime'] = pd.to_datetime(df.datepicker + 'T' + df.starttime, utc=True)
合并
# read the file
df = pd.read_csv('CONCACAF_First_Round.csv')
# fix starttime
df['starttime'] = df.starttime.str.replace('−', '-').str.replace(' UTC', ':00 ').str.split(' ').apply(lambda y: y[0] + y[1][0] + y[1][1:].zfill(2) + '00')
# create datetime column in datetime64[ns, UTC] format
df['utc_datetime'] = pd.to_datetime(df.datepicker + 'T' + df.starttime, utc=True)
额外
- 一旦有具有日期时间格式的列,请使用 pandas
.dt
访问器提取日期时间组件。
# extract time from utc_datetime
df.utc_datetime.dt.time
# extract date from utc_datetime
df.utc_datetime.dt.date
结果
datepicker starttime hometeam awayteam homescore venue utc_datetime
2015-03-25 19:30:00-0400 Bahamas Bermuda 0–5 Thomas Robinson Stadium, Nassau 2015-03-25 23:30:00+00:00
2015-03-29 15:00:00-0300 Bermuda Bahamas 3–0 Bermuda National Stadium, Devonshire 2015-03-29 18:00:00+00:00
2015-03-26 19:00:00-0400 British Virgin Islands Dominica 2–3 Windsor Park, Roseau (Dominica)[note 2] 2015-03-26 23:00:00+00:00
2015-03-29 17:00:00-0400 Dominica British Virgin Islands 0–0 Windsor Park, Roseau 2015-03-29 21:00:00+00:00
2015-03-22 19:00:00-0400 Barbados U.S. Virgin Islands 0–1 Barbados National Stadium, Bridgetown 2015-03-22 23:00:00+00:00
2015-03-26 15:30:00-0400 U.S. Virgin Islands Barbados 0–4 Addelita Cancryn Junior High School Ground, Charlotte Amalie 2015-03-26 19:30:00+00:00
2015-03-23 20:00:00-0400 Saint Kitts and Nevis Turks and Caicos Islands 6–2 Warner Park, Basseterre 2015-03-24 00:00:00+00:00
2015-03-26 19:00:00-0400 Turks and Caicos Islands Saint Kitts and Nevis 2–6 TCIFA National Academy, Providenciales 2015-03-26 23:00:00+00:00
2015-03-23 18:00:00-0600 Nicaragua Anguilla 5–0 Nicaragua National Football Stadium, Managua 2015-03-24 00:00:00+00:00
2015-03-29 17:00:00-0400 Anguilla Nicaragua 0–3 Ronald Webster Park, The Valley 2015-03-29 21:00:00+00:00
2015-03-25 20:00:00-0600 Belize Cayman Islands 0–0 FFB Stadium, Belmopan 2015-03-26 02:00:00+00:00
2015-03-29 19:00:00-0500 Cayman Islands Belize 1–1 Truman Bodden Sports Complex, George Town 2015-03-30 00:00:00+00:00
2015-03-27 20:00:00-0400 Curaçao Montserrat 2–1 Ergilio Hato Stadium, Willemstad 2015-03-28 00:00:00+00:00
2015-03-31 19:00:00-0400 Montserrat Curaçao 2–2 Blakes Estate Stadium, St. John's 2015-03-31 23:00:00+00:00
你的"减号"并不都是减号。 例如,在您的错误消息中:
Unknown string format: 2015-03-25 19:30 −4
如果将看起来像减号的倒数第二个字符作为字符串复制到 Python 中,您将看到:
>>> '−'
'xe2x88x92'
那就是U+2212,一个特殊的Unicode"减号",与键盘上的"-"不同。
您需要将这些Unicode"减号"替换为普通的ASCII减号,然后它应该可以工作。
在 2013 年有一个关于这个的 Python 思想讨论,"Unicode 减去数字转换中的符号",其中有人说:
至于支持非 ASCII 加号和减号,我原则上很热,但在实践中不冷不热。我认为这将是一个不错的选择,如果有人做了工作来确定哪些角色应该被接受,我会支持将其添加为新功能。但我不认为缺乏对非 ASCII 数字符号的支持是一个错误。
这条线索最终是徒劳的,因为每个人都多次用诸如泰国数字是否应该在float()
中使用之类的次要话题来打死他们的爱好马。 如果有人承担这项任务,则更集中地支持 U+2212 和一些 Unicode"加号"可能会更好。