如何在不添加重复项的情况下使用熊猫更新 CSV 文件

我正在尝试从网络上获取一些数据，这需要一段时间。万一发生任何事情，我一直在定期将数据保存在csv文件中。

但是，它只是将数据帧的新副本追加到 CSV 文件。这意味着文件中有大量重复项。

df.to_csv('data.csv', mode='a', header=False)

是我用来保存进度的命令。

感谢您的阅读。

IIUC，您有一个随时间推移追加到的数据帧，并且要定期备份该数据帧。

您可以尝试多种方法：

如果写入文件的速度很快，而不是追加，只需每次都写入完整的数据帧(在这种情况下，写入标头可能很有用(：

df.to_csv('data.csv', header=False)  # or header=True

跟踪您已经写了哪些行，只附加新行：

# (i) First time write the complete dataframe
df.to_csv('data.csv', header=False)  # or header=True
# (ii) store the length of the dataframe at that point
lines_written = len(df.index)
# More data is being added to the dataframe from the web
# (iii) append new lines to CSV file
df.iloc[lines_written:].to_csv('data.csv', mode='a', header=False)
# (iv) update the line counter
lines_written = len(df.index)
# repeat steps (iii) and (iv)

相关内容

最新更新

热门标签：