如何按第一列比较两个CSV，看看是否有任何重复使用Python而不是熊猫

我正在尝试读取两个CSV文件，并比较第一列(ID(，看看是否存在重复。如果它们重复，我必须告诉用户哪些ID是重复的，并且只将唯一ID的值复制到第一个csv中。

first csv = transactions_ledgers.csv(包含列标题(，其中存储了一些值。列[0]是ID。

second csv = transactions_q1.csv(无列标题(列[0]上的值是ID的

所以我需要阅读第二个CSV，看看第一个CSV上是否有重复。如果有，我需要打印重复的ID，并且只将唯一的ID从csv2附加到csv1。

不幸的是，我不能使用熊猫；它必须使用Python。

#Using set
with open('transactions_q1.csv') as check_file:
#header = next(check_file)
check_set = set([row.split(',')[0].strip().upper() for row in check_file])
with open('transactions_q2.csv', 'r') as in_file, open('unique_transaction_ledgers.csv', 'w') as out_file:
for line in in_file:
if line.split(',')[0].strip().upper() in check_set:
out_file.write(line)```
# Second Option
with open('transaction_ledger.csv','r') as file:
read = csv.reader(file)
header=next(file)
for line in read:
ids=line[0]
with open('transactions_q2.csv','r') as file:
read = csv.reader(file)
header=next(file)
for line in read:
ids2=line[0]
if ids == ids2:
print("oh no duplicates")

以下是我试图结合@craigb回答我最初问题的函数。

我正在共享的代码是一个最终完成以下任务的函数：

询问用户想要执行什么事务：统计或导入(统计正在工作(。当他们选择导入时，需要做两件事：
检查用户试图导入的文件是否存在，如果不存在，则告诉用户该文件不存在，用户可以重试
如果文件确实存在，则读取其内容，以便能够将ID列(唯一标识符(与transaction_bledgers.csv(存储所有以前事务的主文件(中的ID列进行比较。如果导入的CSV包含重复的事务，则告诉用户哪些事务ID是重复的，并仅将唯一事务添加到transaction_legers.CSV中，并提示用户生成或导出到transactions_ledgers.CSV中的唯一事务总数
有两个选项transaction_bledgers.csv已经存在(它是在某人运行导入函数时创建的，或者可能不存在，并且需要使用导入文件的所有内容创建
让用户执行任意数量的事务

我当前的代码正在运行，但我面临一些不便。

文件transaction_bledgers.csv没有保留过去的数据并重置。这可能是因为我在写而不是在附录中。但我不确定这一点
在计算唯一事务时，如果事务中存在重复值，我将得到0，而不是已进行的事务总数

提前感谢您提供的任何指导！

#Create the ImportFunction
#Import dependencies
def ImportFunction():
#This function is created for the user to be able to import new files into the main file and check for duplicate transactions
#Set counters, lists and sets that will be used in the code
ids= set()
total_rows = 0
duplicate_lines = []
duplicate=[]

#First step is to ask the user what file they would like to import
which_file = input("Which file would you like to import?")

#Using os.path.exists check if the 'main' file transaction_ledgeres.csv exists. if it does set varible to 'yes' if not set variable to 'no'
if os.path.exists('transaction_ledger.csv'):
main_file ='yes'
else:
main_file = 'no'
#Created a conditional statement to check if the "user choice" file existis or not
if os.path.exists(which_file):
is_file='yes'
else:
#Prompt the user the file doesnt exist in this file path
print(f"This file '{which_file}' does not exist.")
is_file='no'

#Created a conditional statement if the file does or doesn't exist to perform actions
if main_file == 'yes':    
#If true then open the file and perform actions
if is_file == 'yes':
#Second step is to load the file into python. 
with open(which_file,'r') as file:
user_file = file.read()
#print(user_file)             
print("File Imported succesfully, checking for possible duplicate transactions...")    
#Reading in the transaction_ledger.csv Since its true
with open('transaction_ledger.csv','r') as file:
trans_file = csv.reader(file)
header = next(trans_file) #This will skip the headers on CSV to be able to compare ID in column one and see if there is duplicate values
#Created loop with the help @craigb anwser on my quesiton on stackover overflow
for i in trans_file:
ids.add(i[0].strip().upper())
#NEW FILE WITH UNIQUE TRANSACTION CONTENT ONLY NO DUPLICATES with the help @craigb anwser on my quesiton on stackover overflow
with open(which_file,'r') as file, open('unique_transaction_ledgers.csv', 'w') as out_file,open('duplicated_transaction_ledgers.csv', 'w') as duplicated_file:
reader = csv.reader(file)
writer = csv.writer(out_file)
writer2 = csv.writer(duplicated_file)
for line in reader:
if line[0].strip().upper() in ids:
total_rows += 1
writer.writerow(line)
duplicate.append(line[0])
print(f"The following transactions have already been recorded: {duplicate} ")
print(f"Number of current transactions:{total_rows-len(duplicate)}")
for line in reader:
if line[0].strip().upper() not in ids:
total_rows += 1
duplicate_lines.append(line[0])
writer2.writerow(line)

print(f"The following transactions have already been recorded: {line} ")
print(f"Number of current transactions:{total_rows-len(duplicate_lines)}")# this is always zero in the terminal and it should not be
#If file doesn't exist set flag to true
else:
flag = True
#If the main file does not exist then create it and perform the actions needed        
if main_file == 'no':

#Open the user file and read it to be able to append to the new empty file transaction_ledger.csv
with open(which_file,'r') as old_file:
reader_obj = csv.reader(old_file) #read the current csv file
with open('transaction_ledger.csv', 'w') as new_file:
writer_obj = csv.writer(new_file, delimiter=",",lineterminator = "n")
header = ['ID', 'COMPANY NAME', 'DATE', 'AMOUNT', 'STATUS']
writer_obj.writerow(header)
for data in reader_obj:
total_rows += 1
#loop through the read data and write each row in transaction_Ledger.csv
writer_obj.writerow(data)
print("New file created and filled in with imported file data")
print(f"Number of current transactions:{total_rows}")


#Create boolean value for while loop name it flag
flag = True
#Created while loop
while flag:
#Ask the user what action they want to perform. 
user_selection = input("What would you like to perform (import/statistics) or end the simulation?")
# Conditional - if they choose stats then call the function statistics
if user_selection.strip().lower() == 'statistics':
statistics()
#Ask user if they want to run another analysis 
repeat1 = input('Would you like to run any further analyses (yes/no)?' )
if 'y' in repeat1.strip().lower():
#ask user what they want to do
next_step = input("What would you like to perform (import/statistics)?")
if next_step.lower().strip() == 'statistics':
#if statistics than run the statistics function
statistics()
#keep flag true to loop again if the user wants to 
flag = True

#If import run the ImportFunction and keep flag true
elif next_step.lower().strip() == 'import':
ImportFunction()
flag = True

#If they don't want to continue then end the simulation and turn flag False        
else:
print("Thanks for using this simulation")
flag = False

#Conditional -  if they choose the import function then call the ImportFunction
elif user_selection.lower().strip() == 'import':
ImportFunction()
#Ask user if they want to run another analysis 
repeat = input('Would you like to run any further analyses (yes/no)?' )
if 'y' in repeat.strip().lower():
flag = True
else:
print("Thanks for using this simulation")
flag = False
elif 'e' in user_selection[0].lower().strip():
print("Thanks for using this simulation")
flag = False
#If the user inputs an invalid operation then prompt them that the program doesn't perform the action and ask them to try again
else:
print("Sorry, this program can only run the following two functions: 1. Import 2. Statistics or you can end the simulation")
flag = True
'''

感谢您编写一些代码。第一个选项按原样工作，尽管它写入两个文件共用的ID。这是逻辑颠倒的代码——它只写ID在transactions_q2.csv中但不在transactions_q1.csv中的行(只需用not in替换in(：

#Using set
with open('transactions_q1.csv') as check_file:
check_set = set([row.split(',')[0].strip().upper() for row in check_file])
with open('transactions_q2.csv', 'r') as in_file, open('unique_transaction_ledgers.csv', 'w') as out_file:
for line in in_file:
if line.split(',')[0].strip().upper() not in check_set:
out_file.write(line)

使用csv模块的第二个选项不会从第一个文件中收集ID。以下是一些构建集合并像第一个选项一样检查它的更改(我删除了header()代码-如果有头，就把它放回原处；我希望它能像选项#1一样工作(：

# Second Option
ids = set()
with open('transactions_q1.csv','r') as file:
read = csv.reader(file)
for line in read:
ids.add(line[0].strip().upper())
with open('transactions_q2.csv','r') as file, open('unique_transaction_ledgers.csv', 'w') as out_file:
reader = csv.reader(file)
writer = csv.writer(out_file)
for line in reader:
if line[0].strip().upper() not in ids:
writer.writerow(line)

你的第三个选择还不完整，很难理解你想做什么。

相关内容

最新更新

热门标签：