有一个日期时间列,其中日期和时间的格式不同:
yyyy-mm-dd hh:mm:ss.s
UNIX timestamp
0 2018-05-07 04:28:45.970
1 2018-05-13 07:26:04.964
2 2018-05-11 11:43:42.832
3 2018-05-13 15:20:23.436
4 2018-05-08 20:44:25.238
5 2018-05-24 13:10:32.908
6 2018-05-23 14:52:18.564
7 2018-05-13 03:13:21.958
8 2018-05-10 02:04:54.844
9 2018-05-25 13:40:03.094
10 2018-05-27 05:35:31.253
11 2018-05-16 11:56:33.028
12 2018-05-23 18:40:27.934
13 1527051855673000000
14 1526366895249000000
15 2018-05-08 16:53:59.632
16 2018-05-27 07:16:57.625
17 1526982800506000000
18 2018-05-16 16:07:32.948
19 2018-05-08 02:35:40.805
20 2018-05-17 16:12:56.364
21 1527058921790000000
22 2018-05-07 04:39:52.046
23 2018-05-10 06:28:53.391
24 2018-05-08 12:40:02.153
25 NaN
如何将它们隔离并转换为一个日期时间列?
假设OP只有%Y-%m-%d%H:%m:%S格式的任一字符串。%f或unix时间戳或其数据中的NaN帧:
from datetime import datetime
import pandas as pd
def unix_or_dt(a):
my_format = '%Y/%m/%d %H/%M/%S'
try:
return datetime.strptime(str(a), '%Y-%m-%d %H:%M:%S.%f').strftime(my_format)
except:
return datetime.utcfromtimestamp(int(a[:10])).strftime(my_format)
data = pd.read_csv('data.csv')
data.dropna(inplace=True)
data['formatted'] = data['datetime'].apply(lambda row: unix_or_dt(row))
data
这处理了null和unix数量太大而无法存储的问题
你可以试试这个:
import datetime
s = df['dates']
# split pd.Series into nan values and non-nan values
m_na = s.isna()
s = s[~m_na]
s_na = s[m_na]
# seperate dates into timestamp and unix timestamp
m = s.str.contains(' ')
# apply timestamp transformation to datetime, based on their current type
s_dt = pd.to_datetime(s[m])
s_ux = s[~m].map(lambda ts: datetime.datetime.fromtimestamp(int(ts)/100000000))
# stich nan series, and both timestamp series back together
df['dates'] = pd.concat([s_dt, s_ux, s_na])
输出:
0 2018-05-07 04:28:45.970000
1 2018-05-13 07:26:04.964000
2 2018-05-11 11:43:42.832000
3 2018-05-13 15:20:23.436000
4 2018-05-08 20:44:25.238000
5 2018-05-24 13:10:32.908000
6 2018-05-23 14:52:18.564000
7 2018-05-13 03:13:21.958000
8 2018-05-10 02:04:54.844000
9 2018-05-25 13:40:03.094000
10 2018-05-27 05:35:31.253000
11 2018-05-16 11:56:33.028000
12 2018-05-23 18:40:27.934000
13 2453-11-26 03:42:36.730000
14 2453-09-07 22:02:32.490000
15 2018-05-08 16:53:59.632000
16 2018-05-27 07:16:57.625000
17 2453-11-18 03:53:25.059999
18 2018-05-16 16:07:32.948000
19 2018-05-08 02:35:40.805000
20 2018-05-17 16:12:56.364000
21 2453-11-26 23:20:17.900000
22 2018-05-07 04:39:52.046000
23 2018-05-10 06:28:53.391000
24 2018-05-08 12:40:02.153000
25 NaN
Name: dates, dtype: object
样本输入:
import numpy as np
import pandas as pd
df = pd.DataFrame({'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25}, 'dates': {0: '2018-05-07 04:28:45.970', 1: '2018-05-13 07:26:04.964', 2: '2018-05-11 11:43:42.832', 3: '2018-05-13 15:20:23.436', 4: '2018-05-08 20:44:25.238', 5: '2018-05-24 13:10:32.908', 6: '2018-05-23 14:52:18.564', 7: '2018-05-13 03:13:21.958', 8: '2018-05-10 02:04:54.844', 9: '2018-05-25 13:40:03.094', 10: '2018-05-27 05:35:31.253', 11: '2018-05-16 11:56:33.028', 12: '2018-05-23 18:40:27.934', 13: '1527051855673000000', 14: '1526366895249000000', 15: '2018-05-08 16:53:59.632', 16: '2018-05-27 07:16:57.625', 17: '1526982800506000000', 18: '2018-05-16 16:07:32.948', 19: '2018-05-08 02:35:40.805', 20: '2018-05-17 16:12:56.364', 21: '1527058921790000000', 22: '2018-05-07 04:39:52.046', 23: '2018-05-10 06:28:53.391', 24: '2018-05-08 12:40:02.153', 25: np.nan}})