具有两种不同格式的日期时间列



有一个日期时间列,其中日期和时间的格式不同:

  1. yyyy-mm-dd hh:mm:ss.s
  2. UNIX timestamp
0     2018-05-07 04:28:45.970
1     2018-05-13 07:26:04.964
2     2018-05-11 11:43:42.832
3     2018-05-13 15:20:23.436
4     2018-05-08 20:44:25.238
5     2018-05-24 13:10:32.908
6     2018-05-23 14:52:18.564
7     2018-05-13 03:13:21.958
8     2018-05-10 02:04:54.844
9     2018-05-25 13:40:03.094
10    2018-05-27 05:35:31.253
11    2018-05-16 11:56:33.028
12    2018-05-23 18:40:27.934
13        1527051855673000000
14        1526366895249000000
15    2018-05-08 16:53:59.632
16    2018-05-27 07:16:57.625
17        1526982800506000000
18    2018-05-16 16:07:32.948
19    2018-05-08 02:35:40.805
20    2018-05-17 16:12:56.364
21        1527058921790000000
22    2018-05-07 04:39:52.046
23    2018-05-10 06:28:53.391
24    2018-05-08 12:40:02.153
25                        NaN

如何将它们隔离并转换为一个日期时间列?

假设OP只有%Y-%m-%d%H:%m:%S格式的任一字符串。%f或unix时间戳或其数据中的NaN帧:

from datetime import datetime
import pandas as pd
def unix_or_dt(a):
my_format = '%Y/%m/%d %H/%M/%S'
try:
return datetime.strptime(str(a), '%Y-%m-%d %H:%M:%S.%f').strftime(my_format)
except:
return datetime.utcfromtimestamp(int(a[:10])).strftime(my_format)

data = pd.read_csv('data.csv')
data.dropna(inplace=True)
data['formatted'] = data['datetime'].apply(lambda row: unix_or_dt(row))   
data

这处理了null和unix数量太大而无法存储的问题

你可以试试这个:

import datetime
s = df['dates']
# split pd.Series into nan values and non-nan values
m_na = s.isna()
s = s[~m_na]
s_na = s[m_na]
# seperate dates into timestamp and unix timestamp
m = s.str.contains(' ')
# apply timestamp transformation to datetime, based on their current type
s_dt = pd.to_datetime(s[m])
s_ux = s[~m].map(lambda ts: datetime.datetime.fromtimestamp(int(ts)/100000000))
# stich nan series, and both timestamp series back together
df['dates'] = pd.concat([s_dt, s_ux, s_na])

输出:

0     2018-05-07 04:28:45.970000
1     2018-05-13 07:26:04.964000
2     2018-05-11 11:43:42.832000
3     2018-05-13 15:20:23.436000
4     2018-05-08 20:44:25.238000
5     2018-05-24 13:10:32.908000
6     2018-05-23 14:52:18.564000
7     2018-05-13 03:13:21.958000
8     2018-05-10 02:04:54.844000
9     2018-05-25 13:40:03.094000
10    2018-05-27 05:35:31.253000
11    2018-05-16 11:56:33.028000
12    2018-05-23 18:40:27.934000
13    2453-11-26 03:42:36.730000
14    2453-09-07 22:02:32.490000
15    2018-05-08 16:53:59.632000
16    2018-05-27 07:16:57.625000
17    2453-11-18 03:53:25.059999
18    2018-05-16 16:07:32.948000
19    2018-05-08 02:35:40.805000
20    2018-05-17 16:12:56.364000
21    2453-11-26 23:20:17.900000
22    2018-05-07 04:39:52.046000
23    2018-05-10 06:28:53.391000
24    2018-05-08 12:40:02.153000
25                           NaN
Name: dates, dtype: object

样本输入:

import numpy as np
import pandas as pd
df = pd.DataFrame({'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25}, 'dates': {0: '2018-05-07 04:28:45.970', 1: '2018-05-13 07:26:04.964', 2: '2018-05-11 11:43:42.832', 3: '2018-05-13 15:20:23.436', 4: '2018-05-08 20:44:25.238', 5: '2018-05-24 13:10:32.908', 6: '2018-05-23 14:52:18.564', 7: '2018-05-13 03:13:21.958', 8: '2018-05-10 02:04:54.844', 9: '2018-05-25 13:40:03.094', 10: '2018-05-27 05:35:31.253', 11: '2018-05-16 11:56:33.028', 12: '2018-05-23 18:40:27.934', 13: '1527051855673000000', 14: '1526366895249000000', 15: '2018-05-08 16:53:59.632', 16: '2018-05-27 07:16:57.625', 17: '1526982800506000000', 18: '2018-05-16 16:07:32.948', 19: '2018-05-08 02:35:40.805', 20: '2018-05-17 16:12:56.364', 21: '1527058921790000000', 22: '2018-05-07 04:39:52.046', 23: '2018-05-10 06:28:53.391', 24: '2018-05-08 12:40:02.153', 25: np.nan}})

相关内容

  • 没有找到相关文章

最新更新