使用python解析/格式化600k行csv文件速度极慢

我几天前开始使用python后的第一个问题。我在VBA和Matlab方面有着丰富的经验，但我目前正在尝试python作为一种练习，因为它背后有更大的（是吗？）量子社区。

在问这里之前，我搜索了很多，我甚至使用了一些代码片段，这些代码片段是我在做我的工作时从其他用户那里找到的（谢谢大家）。

问题是，我正在读取一个约630k行长（15mb）的tick数据csv，这样我就可以提取它的第三列（最后一个tick/deals），他们为它们创建一个结构（matlab行话），一个DataFrame格式的所有tick的列向量，这样我就能计算它们的pct_change（panda）。

我让它连夜运行了大约6个小时，它仍然是@~150k/630k我确信我正在做一些效率极低的事情。

我目前使用的是spyder，它运行在Windows7、4gb ram、i3内核上，并不是很重负载。

这是代码：

"""""""""""""""""""""""""""""""""""""""
created on Sun Jan 03 12:59:25 2016
@author: eduardo
"""""""""""""""""""""""""""""""""""""""
import pandas as pd
import csv as csv
from datetime import datetime
startTime = datetime.now()
path = "C:Userseduardo.xystartups"
data = "C:Userseduardo.xystartupsINDV14.CSV"
delimeters = [' ', ';'] # matrix [1,2]
unique = '[]'  # empty struct ?
close = [] # empty matrix for later use
with open(data) as data: # data = csv
    for row in data: # counter to loop for inside csv
        for cols in data: # another counter for separating columns now
            for d in delimeters:cols = unique.join(cols.split(d))
            # last for loop does not need ":" ? 
            # from d to d+n, step 1
            # splits columns using "d" separators defined above
            # joins them after splitting, by a "[]" separator "space" ?
            row = cols.split(unique) # row = for each row splitted
            close.append(row[2]) # call third column of each (row) 
            # fill the empty matrix created above (close) row by row
            # with it up with a column vector of my 3rd col of the CSV
            ticks = map(int, close) # coverts strings to integers
            # format the column vector above to pandas DataFrame format
            deals = pd.DataFrame(ticks)
            # call pct_change function of pandas 
            daily_returns = deals.pct_change(periods=1)
            print(daily_returns)
    data.close() # closes csv file 
    # creates a new file ("W"rite), returns.csv
    dataCSV = open('returns.csv', 'w') 
    for line in daily_returns: # de for each line in the daily returns struct
        dataCSV.write(line) # writes them in the new csv file 
    dataCSV.close() # closes new file
    datetime.now() - startTime # time counter

csv格式：

20140801 105159;57085;5
20140801 105206;57085;5

我认为，如果你告诉panda解析你的日期并通过分隔符：，这将起作用

In [7]:
import pandas as pd
import io
t="""20140801 105159;57085;5
20140801 105206;57085;5"""
df = pd.read_csv(io.StringIO(t), sep=';', header=None, parse_dates=[0])
df
Out[7]:
                    0      1  2
0 2014-08-01 10:51:59  57085  5
1 2014-08-01 10:52:06  57085  5

因此，在您的情况下，这应该有效：

df = pd.read_csv(data, header=None, parse_dates=[0], sep=';')

然后你可以把它写回csv:

df.to_csv('returns.csv')

你可以看到熊猫正确地嗅到了数据类型：

In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
0    2 non-null datetime64[ns]
1    2 non-null int64
2    2 non-null int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 64.0 bytes

如果你只想提取第三列，你可以做得简单得多，使用csv-lib并在分号上进行拆分，然后使用第二个元素，这相当于拆分代码中使用的第三列：

import csv
from operator import itemgetter
with open(data) as data: # data = csv
    for ith in map(itemgetter(1), csv.reader(data, delimiter=";")):
        print(ith)

While将输出：

57085
57085

在你自己的代码中，你在循环中创建DF，这样你就不会在最后一次调用时实际存储任何数据栏，所以即使它完成了，你也基本上没有数据要写。如果你真的想在分号上分割数据帧，并提取第二列df = pd.read_csv(data,sep=";",usecols=[1], header=None)，这会给你：

   1
0  57085
1  57085

此外，unique = '[]' # empty struct ?还创建了一个字符串，unique.join(cols.split(d))将所有由"[]"分隔的数据连接在一起，然后是row = cols.split(unique)，这正是以前cols.split(d)时的数据。

此外，如果你只想要第二列的数组，你可以使用numpy.genfromtxt

import numpy as np
arr = np.genfromtxt(data, usecols=[1], delimiter=";")
print(arr)

这会给你：

[57085.  57085.]

或者，如果你想要百分比变化：

pct_chnge = np.diff(arr) / arr[:-1]

如果你想使用完整的数据框来合并日期，以获得每天的百分比变化等：

import pandas as pd
df = pd.read_csv(data, sep=";", header=None,parse_dates=[0])

感谢所有响应的人：

以下是在你的帮助下它的结局：

`# -*- coding: utf-8 -*-
"""""""""""""""""""""""""""
Created on Mon Jan 04 09:21:10 2016
@author: eduardo
"""""""""""""""""""""""""""
from datetime import datetime
startTime = datetime.now()
import pandas as pd
import csv 
from operator import itemgetter
data = "C:Userseduardo.xystartupsINDV14.CSV" # address
ticks = [] # calling empty matrix
with open(data) as data: # data = original csv
    # for i(n) in first column after ";" in the open csv, "str" to "int"  
    for ith in map(int,map(itemgetter(1), csv.reader(data, delimiter=";"))):
        # call empy matrix shell above and fill it row by row with "int"s
        ticks.append(ith)
        data.close() # close csv file so it doesnt get stuck
# format matrix above ("ticks) to pandas DF format
prcnt_chng_df = pd.DataFrame(ticks)
# use pandas percent change (period=1) function for above DF with padding
prcnt_chng = prcnt_chng_df.pct_change(periods=1, fill_method='pad')
# print above structure of pct changes starting on the 2nd row
# no indexing since i dont need them and it would make the file bigger
print((prcnt_chng)[1:].to_csv('indv14_rets.csv', index=False))
timer = datetime.now() - startTime
print(timer) # counter`

我设法得到了我想要的东西，所以我现在可以将这个新的csv文件调用到其他一些脚本中，以便计算整个月内发生的未来合同的逐点回报的传播密度。

所有这些都是在0:00:02.996000 中完成的

向致以最良好的问候

相关内容

最新更新

热门标签：