使用分钟级 OHLCV 数据构建 Pandas 数据帧,其中包含 930、931、932...17:28、17:29、17:30 格式的列'Time'



我有这种格式的分钟分辨率库存数据

19980102 930 3.29473 3.30923 3.29473 3.29473 76119.2 4 0 0 0

列在哪里"日期","时间","开放","高","低","关闭","音量","分裂因子","收入","股息"

(线终止于0x0a(lineFeed)不cr lf。它看起来像pandas.read_csv cops,读取此数据的读数非常好。)

数据来自一家名为www.quantquote.com

的公司

我是Pandas和Python的新手,但我已经阅读了Wes McKinney的" Python进行数据分析"。这是我的第一个Python项目,但我已经写了很多年的C,C ,C#,汇编器等代码。

我的目的是为每个股票创建一个数据框,并将日期和时间合并为时间表" dateTime"索引。然后,我打算将其重新置于每日和每月的数据范围。

我已经尝试了多种方法来获取pandas.read_csv来一步读取此信息,但我找不到一种方法来读取"时间"列。它是一个奇怪的

930,931,932 ... 17:28,17:29,17:30

格式

有更好的方法吗?如果可能

这是我到目前为止所拥有的。

from pandas import Series, DataFrame
import pandas as pd
import os
mypath = "c:PythonStuffQuantQuoteDatatable_aapl.txt"
columnHeadings = ['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'Split Factor', 'Earnings', 'Dividends']
minData = pd.read_csv(mypath, names=columnHeadings, dtype= {"Date":str, "Time":str}, parse_dates = False, sep=' ')
minData["Time"] = minData["Time"].map(lambda x : x[:-2] + ":" + x[-2:] + ":00+00:00")
minData["DateTimeStr"] = minData["Date"] + " " + minData["Time"]
minData["DateTime"] = pd.to_datetime(minData["DateTimeStr"])
minData.index = minData["DateTime"]

这是" table_aapl.txt"文件

的前几行

19980102 930 3.29473 3.30923 3.29473 3.29473 76119.2 4 0 0 0 19980102 931 3.29473 3.29473 3.2778 3.29473 263522 4 0 0 0 19980102 932 3.29473 >

任何帮助都非常感谢


编辑:最后,这是我可以提出的最好的解决方案。回想起来,一切都很明显:)

感谢您的帮助:)

columnHeadings = ['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'Split Factor', 'Earnings', 'Dividends']
minData = pd.read_csv(
    myFile,
    header = None,
    names = columnHeadings, 
    parse_dates = [["Date", "Time"]],
    date_parser = lambda x: datetime.datetime.strptime(x, '%Y%m%d %H%M'), 
    index_col = "Date_Time",
    sep=' ')

我使用您的代码遇到了问题,但是以下情况对我有用,应该对您进行一些次要调整,重要的一步是使用Strptime从字符串中创建DateTime:p>导入数据:

minData = pd.read_csv(r'c:data.txt', parse_dates = [[0,1]], header=None,sep=' ')
minData
Out[17]:
            0_1        2        3        4        5         6  7  8  9  10
0  19980102 930  3.29473  3.30923  3.29473  3.29473   76119.2  4  0  0 NaN
1  19980102 931  3.29473  3.29473  3.27780  3.29473  263522.0  4  0  0 NaN
2  19980102 932  3.29473  3.29473  3.27780  3.29473  120384.0  4  0  0 NaN
3  19980102 933  3.29473  3.29473  3.26330  3.26330   82738.3  4  0  0 NaN
4  19980102 934  3.26330  3.29473  3.26330  3.27780   11169.6  4  0  0 NaN
5  19980102 935  3.29473  3.29473  3.27780  3.27780   11997.0  4  0  0 NaN
6  19980102 936  3.26330  3.29473  3.26330  3.27780  109628.0  4  0  0 NaN
[7 rows x 10 columns]
#now convert the string using datetime.datetime.strptime:
# rename the first column (may not be necessary for you)
In [20]:
minData.rename(columns={'0_1':columnHeadings[0], 1:columnHeadings[1]},inplace=True)
minData
Out[20]:
           Date        2        3        4        5         6  7  8  9  10
0  19980102 930  3.29473  3.30923  3.29473  3.29473   76119.2  4  0  0 NaN
1  19980102 931  3.29473  3.29473  3.27780  3.29473  263522.0  4  0  0 NaN
2  19980102 932  3.29473  3.29473  3.27780  3.29473  120384.0  4  0  0 NaN
3  19980102 933  3.29473  3.29473  3.26330  3.26330   82738.3  4  0  0 NaN
4  19980102 934  3.26330  3.29473  3.26330  3.27780   11169.6  4  0  0 NaN
5  19980102 935  3.29473  3.29473  3.27780  3.27780   11997.0  4  0  0 NaN
6  19980102 936  3.26330  3.29473  3.26330  3.27780  109628.0  4  0  0 NaN
[7 rows x 10 columns]
# now use strptime to format the string into a datetime object
In [21]:
import datetime
minData['Date'] = minData['Date'].apply(lambda x: datetime.datetime.strptime(x, '%Y%m%d %H%M'))
minData
Out[21]:
                 Date        2        3        4        5         6  7  8  9  
0 1998-01-02 09:30:00  3.29473  3.30923  3.29473  3.29473   76119.2  4  0  0   
1 1998-01-02 09:31:00  3.29473  3.29473  3.27780  3.29473  263522.0  4  0  0   
2 1998-01-02 09:32:00  3.29473  3.29473  3.27780  3.29473  120384.0  4  0  0   
3 1998-01-02 09:33:00  3.29473  3.29473  3.26330  3.26330   82738.3  4  0  0   
4 1998-01-02 09:34:00  3.26330  3.29473  3.26330  3.27780   11169.6  4  0  0   
5 1998-01-02 09:35:00  3.29473  3.29473  3.27780  3.27780   11997.0  4  0  0   
6 1998-01-02 09:36:00  3.26330  3.29473  3.26330  3.27780  109628.0  4  0  0   
   10  
0 NaN  
1 NaN  
2 NaN  
3 NaN  
4 NaN  
5 NaN  
6 NaN  
[7 rows x 10 columns]
#confirm that we have converted the dates:
In [22]:
minData.dtypes
Out[22]:
Date    datetime64[ns]
2              float64
3              float64
4              float64
5              float64
6              float64
7                int64
8                int64
9                int64
10             float64
dtype: object

您可以设置索引:

In [24]:
minData.set_index('Date')
Out[24]:
                          2        3        4        5         6   7   8   9   
Date                                                                            
1998-01-02 09:30:00  3.29473  3.30923  3.29473  3.29473   76119.2   4   0   0   
1998-01-02 09:31:00  3.29473  3.29473  3.27780  3.29473  263522.0   4   0   0   
1998-01-02 09:32:00  3.29473  3.29473  3.27780  3.29473  120384.0   4   0   0   
1998-01-02 09:33:00  3.29473  3.29473  3.26330  3.26330   82738.3   4   0   0   
1998-01-02 09:34:00  3.26330  3.29473  3.26330  3.27780   11169.6   4   0   0   
1998-01-02 09:35:00  3.29473  3.29473  3.27780  3.27780   11997.0   4   0   0   
1998-01-02 09:36:00  3.26330  3.29473  3.26330  3.27780  109628.0   4   0   0   
                     10  
Date                     
1998-01-02 09:30:00 NaN  
1998-01-02 09:31:00 NaN  
1998-01-02 09:32:00 NaN  
1998-01-02 09:33:00 NaN  
1998-01-02 09:34:00 NaN  
1998-01-02 09:35:00 NaN  
1998-01-02 09:36:00 NaN  
[7 rows x 9 columns]

尝试以下:

 import pandas as pd
import dateutil.parser as parser
def f(d):
    if len(d)==12:
        d = d[:9]+"0"+d[9:]
    return parser.parse(d)
columnHeadings = ['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'Split Factor', 'Earnings', 'Dividends']
minData = pd.read_csv("table_aapl.txt", names=columnHeadings, sep=' ', parse_dates=[[0,1]], index_col=0, date_parser=f)
print minData

输出:

                        Open     High      Low    Close    Volume  
Date_Time                                                           
1998-01-02 09:30:00  3.29473  3.30923  3.29473  3.29473   76119.2   
1998-01-02 09:31:00  3.29473  3.29473  3.27780  3.29473  263522.0   
1998-01-02 09:32:00  3.29473  3.29473  3.27780  3.29473  120384.0   
1998-01-02 09:33:00  3.29473  3.29473  3.26330  3.26330   82738.3   
1998-01-02 09:34:00  3.26330  3.29473  3.26330  3.27780   11169.6   
1998-01-02 09:35:00  3.29473  3.29473  3.27780  3.27780   11997.0   
1998-01-02 09:36:00  3.26330  3.29473  3.26330  3.27780  109628.0   
                     Split Factor  Earnings  Dividends  
Date_Time                                               
1998-01-02 09:30:00             4         0          0  
1998-01-02 09:31:00             4         0          0  
1998-01-02 09:32:00             4         0          0  
1998-01-02 09:33:00             4         0          0  
1998-01-02 09:34:00             4         0          0  
1998-01-02 09:35:00             4         0          0  
1998-01-02 09:36:00             4         0          0  
[7 rows x 8 columns]

好吧,我刚刚尝试了两个参数函数,它收到了两个值的值,所以我设法做到了:

def g(d,t):
    res = []
    for dd,tt in zip(d,t):                
        date_time = dd
        if len(tt)==3:
            date_time += " 0"
        else:
            date_time += " "
        date_time += tt
        res.append(parser.parse(date_time))
    return pd.DatetimeIndex(res)

现在只称pd.read_csvpd.read_csv("table_aapl.txt", names=columnHeadings, sep=' ', parse_dates=[[0,1]], index_col=0, date_parser=g)

输出:

                        Open     High      Low    Close    Volume  
Date_Time                                                           
1998-01-02 09:30:00  3.29473  3.30923  3.29473  3.29473   76119.2   
1998-01-02 09:31:00  3.29473  3.29473  3.27780  3.29473  263522.0   
1998-01-02 09:32:00  3.29473  3.29473  3.27780  3.29473  120384.0   
1998-01-02 09:33:00  3.29473  3.29473  3.26330  3.26330   82738.3   
1998-01-02 09:34:00  3.26330  3.29473  3.26330  3.27780   11169.6   
1998-01-02 09:35:00  3.29473  3.29473  3.27780  3.27780   11997.0   
1998-01-02 09:36:00  3.26330  3.29473  3.26330  3.27780  109628.0   
                     Split Factor  Earnings  Dividends  
Date_Time                                               
1998-01-02 09:30:00             4         0          0  
1998-01-02 09:31:00             4         0          0  
1998-01-02 09:32:00             4         0          0  
1998-01-02 09:33:00             4         0          0  
1998-01-02 09:34:00             4         0          0  
1998-01-02 09:35:00             4         0          0  
1998-01-02 09:36:00             4         0          0  
[8 rows x 8 columns]

最新更新