将csv文件合并为一个主文件



我面临以下挑战

我有大约400个文件,我必须将它们合并到一个主文件中,但有一个问题是,这些文件有不同的标题,当我试图合并它时,会根据列将数据放入不同的行中

示例:-假设我有两个文件C1和C2文件C1.csv

name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2

和文件C2.csv

name,last-name,phone-no,add-line1,add-line2,add-line3
jorge,aggarwal,65465464654,line1,line2,line3
brad,smit,456446546454,line1,line2,line3
joy,kennedy,65654644646,line1,line2,line3

所以我有这两个文件,从这些文件中,我希望当我合并这些文件时,输出将是这样的:-

name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
Jorge aggarwal,65465464654,line1-line2-line3
brad smith,456446546454,line1-line2-line3
joy kennedy,65654644646,line1-line2-line3

对于合并,我使用以下代码

import glob
import pandas as pd
directory = 'C:/Test' # specify the directory containing the 300 files
filelist = sorted (glob.glob(directory + '/*.csv')) # reads all 300 files in the directory and stores as a list
consolidated = pd.DataFrame() # Create a new empty dataframe for consolidation
for file in filelist:            # Iterate through each of the 300 files
df1 = pd.read_csv(file)      # create df using the file  
df1col = list (df1.columns)  # save columns to a list
df2 = consolidated           # set the consolidated as your df2
df2col = list (df2.columns)  # save columns from consolidated result as list
commoncol = [i for i in df1col for j in df2col if i==j] # Check both lists for common column name
# print (commoncol)
if commoncol == []:          # In first iteration, consolidated file is empty, which will return in a blank df
consolidated = pd.concat([df1, df2], axis=1).fillna(value=0)  # concatenate (outer join) with no common columns replacing null values with 0
else:
consolidated = df1.merge(df2,how='outer', on=commoncol).fillna(value=0)        # merge both df specifying the common column and replace null values with 0
# print (consolidated)   << Optionally, check the consolidated df at each iteration 
# writing consolidated df to another CSV
consolidated.to_csv('C:/<filepath>/consolidated.csv', header=True, index=False)

但是它不能像前面显示的输出那样合并具有相同数据的列。

从两个文件的例子中,您知道输出的最后一个(最不常见的(头,也知道其中一个较大的头是什么样子的。

我对此的看法是;其他";需要映射到最终标头的标头类型,如将添加行1-3连接到单个地址字段中。我们可以使用csv模块逐行读写,并根据输入文件的标题将行发送到适当的合并器(映射(。

csv模块提供了DictReader和DictWriter,这使得处理您通过名称知道的字段非常方便;特别地,DictWriter((构造函数具有extraction=";忽略";选项,这意味着如果你告诉作者你的字段是:

Col1, Col2, Col3

然后你通过一条格言:

{"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"} 

它将忽略Col4,只写Col1-3:

writer = csv.DictWriter(sys.stdout, fieldnames=["Col1", "Col2", "Col3"], extrasaction="ignore")
writer.writeheader()
writer.writerow({"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"})
# Col1,Col2,Col3
# val1,val2,val3
import csv

def consolidate_add_lines_1_to_3(row):
row["address"] = "-".join([row["add-line1"], row["add-line2"], row["add-line3"]])
return row

# Add other consolidators here...
# ...

Final_header = ["name", "phone-no", "address"]
f_out = open("output.csv", "w", newline="")
writer = csv.DictWriter(f_out, fieldnames=Final_header, extrasaction="ignore")
writer.writeheader()
for fname in ["file1.csv", "file2.csv"]:
f_in = open(fname, newline="")
reader = csv.DictReader(f_in)
for row in reader:
if "add-line1" in row and "add-line2" in row and "add-line3" in row:
row = consolidate_add_lines_1_to_3(row)
# Add conditions for other consolidators here...
# ...
writer.writerow(row)
f_in.close()
f_out.close()

如果有不止一种其他标头,您需要找出这些标头,并找出要编写的额外合并程序,以及在for row in reader循环中触发它们的条件。

相关内容

  • 没有找到相关文章

最新更新