panda在重复读取和写入数据库时崩溃



我有以下代码:

import pandas as pd
import time 
import numpy as np
import os 
dir_path = os.path.dirname(os.path.abspath(__file__))
db_path = 'sqlite:///' + dir_path + '/example.db'

df = pd.DataFrame(np.random.normal(0, 1, (500000, 10)), columns=['col%d' % i for i in range(10)])
writes = []
reads = []
for i in range(1000):
if os.path.exists(dir_path + '/example.db'):
print 'db removed from disk'
os.remove(dir_path + '/example.db')
t0 = time.time()
df.to_sql('example', db_path)
t1 = time.time()
print 'time to write:', t1 - t0
df = pd.read_sql('SELECT * FROM example', db_path)
t2 = time.time()
print "time to read:", t2 - t1
writes.append(t1 - t0)
reads.append(t2 - t1)
print 'Average writing time', np.mean(writes), ' n times=', len(writes)
print 'Average reading time', np.mean(reads), ' n times=', len(reads)

它崩溃为:

(mlcne_venv) [lpuggini@machinelearn-1 Desktop]$ python read_write_time.py
db removed from disk
time to write: 16.2156860828
time to read: 3.31190609932
Average writing time 16.2156860828  n times= 1
Average reading time 3.31190609932  n times= 1
db removed from disk
time to write: 15.6938099861
time to read: 3.74449300766
Average writing time 15.9547480345  n times= 2
Average reading time 3.52819955349  n times= 2
db removed from disk
Traceback (most recent call last):
File "read_write_time.py", line 23, in <module>
df.to_sql('example', db_path)
File "/home/lpuggini/VirtualEnvs/mlcne_venv/lib/python2.7/site-packages/pandas/core/generic.py", line 1201, in to_sql
chunksize=chunksize, dtype=dtype)
File "/home/lpuggini/VirtualEnvs/mlcne_venv/lib/python2.7/site-packages/pandas/io/sql.py", line 470, in to_sql
chunksize=chunksize, dtype=dtype)
File "/home/lpuggini/VirtualEnvs/mlcne_venv/lib/python2.7/site-packages/pandas/io/sql.py", line 1148, in to_sql
table.insert(chunksize)
File "/home/lpuggini/VirtualEnvs/mlcne_venv/lib/python2.7/site-packages/pandas/io/sql.py", line 642, in insert
keys, data_list = self.insert_data()
File "/home/lpuggini/VirtualEnvs/mlcne_venv/lib/python2.7/site-packages/pandas/io/sql.py", line 609, in insert_data
"duplicate name in index/columns: {0}".format(err))
ValueError: duplicate name in index/columns: cannot insert level_0, already exists
(mlcne_venv) [lpuggini@machinelearn-1 Desktop]$ 

可能是熊猫虫吗?

每个下一次迭代都会为您提供一个新的索引列。若您在从数据库读取后添加了一个输出命令print df.head(),则会得到以下内容:

1st iteration:
index      col0      col1      col2      col3      col4      col5  
0      0  1.562089  2.297787 -0.490346  0.935589 -0.151018 -0.920552   
1      1 -0.868018  1.212738  1.386166 -0.422851 -0.243167 -0.362809   
2      2 -0.420472  0.069789  2.706735  0.814905  1.095517  0.570413   
3      3 -0.341983 -0.727765  0.549108  0.785910  0.687795  2.158346   
4      4 -0.476875  0.525177 -0.532743  0.382937 -0.824136  1.276228   
col6      col7      col8      col9  
0 -1.179577  1.746765  0.468008  0.751347  
1  0.810910 -0.386281 -0.512447 -1.124378  
2  2.764994  2.735384 -0.449755 -0.393979  
3 -0.895257 -0.396221 -1.149718 -0.382153  
4  0.177278 -0.212614  0.185388  0.688391  
2nd iteration:
level_0  index      col0      col1      col2      col3      col4      col5  
0        0      0  1.562089  2.297787 -0.490346  0.935589 -0.151018 -0.920552   
1        1      1 -0.868018  1.212738  1.386166 -0.422851 -0.243167 -0.362809   
2        2      2 -0.420472  0.069789  2.706735  0.814905  1.095517  0.570413   
3        3      3 -0.341983 -0.727765  0.549108  0.785910  0.687795  2.158346   
4        4      4 -0.476875  0.525177 -0.532743  0.382937 -0.824136  1.276228   
col6      col7      col8      col9  
0 -1.179577  1.746765  0.468008  0.751347  
1  0.810910 -0.386281 -0.512447 -1.124378  
2  2.764994  2.735384 -0.449755 -0.393979  
3 -0.895257 -0.396221 -1.149718 -0.382153  
4  0.177278 -0.212614  0.185388  0.688391  

然后它崩溃了。

只需将写入命令更改为

df.to_sql('example', db_path, index=False)

所以它不会每次都创建一个新的索引。

最新更新