使用Python将.csv导入postgreSQL:消除重复

我正在处理的一个项目有一个.csv文件，每10分钟更新一次。我想在SQL更新时将该数据读取到SQL中。我已经有一个powershell脚本监视.csv导入到的ftp文件夹。看门狗powershell激活一个批处理文件，将.csv重命名为固定名称，将其导入sql，然后将其删除。下面的代码成功地将.csv中的值导入sql表。我只剩下在批处理文件运行时解析重复项，以避免将它们添加到表中。

Python代码

import csv
import pyodbc
#connect to database
#DB connection string
print("Establishing Database connection...")
con = pyodbc.connect('DSN=testdatabase')
cursor = con.cursor()
print("...Connected to database.")

#read file and copy data into analysis server table
print("Reading file contents and copying into database...")
with open('C:\Users\CurrentUser\Desktop\test1.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV) #skips the header row
for row in readCSV: 
cursor.execute("INSERT INTO testtable (id, year, month, day) VALUES (?, ?, ?, ?)",
row[0], row[1], row[2], row[3])
con.commit()
print("...Completed reading file contents and copying into database.")

SQL表将在不截断的情况下持续接收数据，因此使用MERGE with进行某些操作一开始可能效果良好，但几天后会很快陷入困境，因为代码将不得不将.csv与越来越多的数据进行比较。我想把最初.csv中的最后一行保存到一个单独的文件中，以便以后调用。在接下来的10分钟导入迭代中，调用该信息并将其与从底部开始的新.csv进行比较。第一个单元格是一个时间戳，所以为了进行比较，我想从另一个堆栈溢出问题中引入它，如何在Python中比较两个时间戳？

from datetime import datetime
timestamp1 = "Feb 12 08:02:32 2015"
timestamp2 = "Jan 27 11:52:02 2014"
t1 = datetime.strptime(timestamp1, "%b %d %H:%M:%S %Y")
t2 = datetime.strptime(timestamp2, "%b %d %H:%M:%S %Y")
difference = t1 - t2

我的时间戳格式是这样的，

%Y/%m/%d %H:%M:%S.%f

我要提到的是，powershell脚本不能很好地处理同时到达ftp文件夹的多个文件，所以我有很多数据进入一个.csv。我的意思是大约160多列。对于INSERT INTO格式来说，这是一个很大的问题，尽管如果没有更好的方法，我非常愿意添加所有的列标题和值。

总之，有没有更好的方法来做我想做的事情？有没有其他人做过类似的事情，我没有重新发明轮子？如果没有更好的方法来做我想做的事情，我的方法听起来合理吗？非常感谢。

import csv
import pyodbc
import time
from datetime import datetime
#connect to database
#DB connection string
print("Establishing Database connection...")
con = pyodbc.connect('DSN=SQLdatabase')
cursor = con.cursor()
print("...Connected to database.")
#recall last timestamp entry in db table
t1 = datetime.strptime(cursor.execute("SELECT MAX(id) FROM test;").fetchval(), "%Y/%m/%d %H:%M:%S.%f")

#read file and copy data into table
print("Reading file contents and copying into table...")
with open('C:\Users\user\Desktop\test2.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
columns = next(readCSV) #skips the header row
t2 = datetime.strptime(next(readCSV)[0], "%Y/%m/%d %H:%M:%S.%f")
while t2 < t1:
t2 = datetime.strptime(next(readCSV)[0], "%Y/%m/%d %H:%M:%S.%f")
query = 'insert into test({0}) values ({1})'
query = query.format(','.join(columns), ','.join('?' * len(columns)))
for data in readCSV:
cursor.execute(query, data)
con.commit()
print("Data posted to table")

这就是我的归宿。工作良好并且消除了将标题放入"；插入"；表示跳过暂存表，只将.csv内容保留在一个数组中，直到剩下的代码确定需要添加什么。

相关内容

最新更新

热门标签：