我在格式化日期和时间方面存在一些问题。我有包含日期和时间的数据文件。以下是代表我数据的一部分的示例日期。
data = pd.DataFrame()
data['Date'] = ['01 Jul 2014 - Qualification','30 Sep 2014 - Group Stage','17 Mar 2015 - Play Offs',' 19:00:00']
data ['ID'] = [1,2,3,4]
我创建了一个新列,并尝试使用 datetime 进行格式化:
data['date1'] = pd.to_datetime(data.Date,errors = 'coerce')
我在日期时间得到了所有NAT。我还想创建两个新列,例如时间列和阶段,以表示时间和游戏阶段。
如何解决这个问题?
Date
列的文本不仅仅是日期/时间。您不能像以前那样将其转换为DateTime对象。您需要将文本的日期/时间部分与其余部分隔离。为此,您可以在-
上拆分并展开以在temp Dataframe df_temp
的单独列中获取阶段文本和日期,然后使用这些列来分配&在您现有的数据框中创建每个:
In [27]: df_temp = data['Date'].str.split('-', expand=True)
In [28]: data['date1'] = df_temp[0]
In [29]: data['stage'] = df_temp[1]
In [30]: data
Out[30]:
Date ID date1 stage
0 01 Jul 2014 - Qualification 1 01 Jul 2014 Qualification
1 30 Sep 2014 - Group Stage 2 30 Sep 2014 Group Stage
2 17 Mar 2015 - Play Offs 3 17 Mar 2015 Play Offs
3 19:00:00 4 19:00:00 None
In [31]: data['date1'] = pd.to_datetime(data.date1,errors = 'coerce')
In [32]: data
Out[32]:
Date ID date1 stage
0 01 Jul 2014 - Qualification 1 2014-07-01 Qualification
1 30 Sep 2014 - Group Stage 2 2014-09-30 Group Stage
2 17 Mar 2015 - Play Offs 3 2015-03-17 Play Offs
3 19:00:00 4 NaT None
您可以在此处使用Series.str.extract
:
#https://stackoverflow.com/a/47656743
pat = r'(d+/d+(?:/d+)?|(?:d+ )?(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[.,]?(?:-d+-d+| d+(?:th|rd|st|nd)?,? d+| d+)|d{4})'
#https://stackoverflow.com/a/46069885
pat = r'((?:d{,2}s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|.|s|,)s?d{,2}[a-z]*(?:-|,|s)?s?d{2,4})'
s = data['Date'].str.extract(pat, expand=False)
data['date1'] = pd.to_datetime(s, errors = 'coerce')
print (data)
Date ID date1
0 01 Jul 2014 - Qualification 1 2014-07-01
1 30 Sep 2014 - Group Stage 2 2014-09-30
2 17 Mar 2015 - Play Offs 3 2015-03-17
3 19:00:00 4 NaT