pandas to_csv覆盖,防止数据丢失



我有一个脚本,该脚本正在不断更新数据框架并将其保存到磁盘(覆盖旧的CSV文件)。我发现,如果在保存呼叫,df.to_csv("df.csv")中打断程序,则所有数据都会丢失,并且df.csv仅包含列索引为空。

我也许可以通过将数据暂时保存到df.temp.csv,然后替换df.csv来进行解决方法。但是,是否有一种使节省"原子"并防止数据损失的Pythonic,短暂的方法?这是我在"保存呼叫"上打断时会得到的堆栈跟踪。

Traceback (most recent call last):
  File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1531, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 938, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Users/user/test.py", line 49, in <module>
    d.to_csv("out.csv", index=False)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1344, in to_csv
    formatter.save()
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1551, in save
    self._save()
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1652, in _save
    self._save_chunk(start_i, end_i)
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1666, in _save_chunk
    quoting=self.quoting)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 1443, in to_native_types
    return formatter.get_result_as_array()
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2171, in get_result_as_array
    formatted_values = format_values_with(float_format)
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2157, in format_values_with
    for val in values.ravel()[imask]])
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2108, in base_formatter
    return str(v) if notnull(v) else self.na_rep
  File "/usr/local/lib/python2.7/site-packages/pandas/core/common.py", line 250, in notnull
    res = isnull(obj)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/common.py", line 73, in isnull
    def isnull(obj):
  File "_pydevd_bundle/pydevd_cython.pyx", line 937, in _pydevd_bundle.pydevd_cython.ThreadTracer.__call__ (_pydevd_bundle/pydevd_cython.c:15522)
  File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/_pydev_bundle/pydev_is_thread_alive.py", line 14, in is_thread_alive
    def is_thread_alive(t):
KeyboardInterrupt

您可以创建一个上下文管理器来处理您的原子覆盖:

import os
import contextlib
@contextlib.contextmanager
def atomic_overwrite(filename):
    temp = filename + '~'
    with open(temp, "w") as f:
        yield f
    os.rename(temp, filename) # this will only happen if no exception was raised

PANDAS DataFrame上的to_csv方法将接受文件对象而不是路径,因此您可以使用:

with atomic_overwrite("df.csv") as f:
    df.to_csv(f)

我选择的临时文件名是带有tilde的请求的文件名。您当然可以更改代码,如果需要,可以使用其他内容。我也不确切地确定该文件应打开哪种模式,您可能需要"wb"而不是"w"

您能做的最好的是实现信号处理程序(signal模块),该信号处理程序终止程序直到最后一个写操作完成。

沿线(伪代码)的东西:

import signal
import sys
import time
import pandas as pd
lock = threading.Lock()
def handler(signum, frame):
    # ensure that latest data is written
    sys.exit(1)
signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)
while True:
    # might exit any time.
    pd.to_csv(...)
    time.sleep(1)

相关内容

最新更新