如何在python中将各种csv文件中的数据合并为一个csv文件

我有12个csv文件，其中包含每个月的工资数据。命名约定为YYYYMMDD。例如，1月份的工资数据存储在名为20200131的csv文件中。我想读取所有csv文件，并将所有员工的工资数据合并到一个csv文件中。每列的标题应为Sal_Jan、Sal_Feb等，而所有csv文件中的员工名称应仅使用一次。

20200131.csv中存储的一月数据为：

工资

姓名
A	20000
B	25000

所以试试这个。这对我来说很有效。我用你指定的文件名创建了csv文件，输出如我所示打印出来。这应该是带注释的完整解决方案。如果你还有其他问题，一定要问。

import pandas as pd
import calendar as cl
import glob

path = r'C:/Users/Akshay/Documents/Question 2'
all_files = glob.glob(path + "/*.csv")
# Need to sort the files so that the columns are in order from Month 1 -> 12
all_files.sort()
# The following function converts the month number to "Sal_Jan",
# "Sal_Feb", etc. The index -8 to -6 is the position in the file
# name where the month number shows up.
# 

def f(fn): return str("Sal_" + cl.month_abbr[int(fn[-8:-6])])

li = []
# The first column in the data frame will be the "Name" column.
# The usecols method is saying to get the 0-th column
li.append(pd.read_csv(all_files[0], index_col=None, header=0, usecols=[0]))
for filename in all_files:
# As each file is accessed, it is only appending the salary data
# and is converting the month column name to the specified one
df = pd.read_csv(filename, usecols=[1], header=0, names=[f(filename)])
li.append(df)
# Specified the Name column as an index, so that the 0,1,2 index is removed.
frame = pd.concat(li, axis=1).set_index('Name')
print(frame)

因此，在格式化csv文件并填充一些数据之后，输出就是这样出现的：

Sal_Jan  Sal_Feb  Sal_Mar  Sal_Apr  Sal_May  Sal_Jun
Name                                                      
A       20000    30000    30255    30510    30765    31020
B       21000    31000    31255    31510    31765    32020
C       22000    32000    32255    32510    32765    33020
D       24000    34000    34255    34510    34765    35020
E       28000    38000    38255    38510    38765    39020
F       10000    20000    20255    20510    20765    21020
G       11000    21000    21255    21510    21765    22020
H       14000    24000    24255    24510    24765    25020
I       13000    23000    23255    23510    23765    24020
J       22500    32500    32755    33010    33265    33520
K       23500    33500    33755    34010    34265    34520

请注意，列标题和数据之间的额外行不是额外的行(即不是NULL行(。它以这种方式打印在控制台中；名称"；列是一个索引。

编辑：所以我刚刚注意到你提供了样本文件，所以我用你的样本文件重新运行了我的代码，这就是输出：

Sal_Jan  Sal_Feb  Sal_Mar  Sal_Apr  Sal_May  Sal_Jun  Sal_Jul  Sal_Aug  Sal_Sep  Sal_Oct
Name                                                                                          
A       10000    15000    20000    25000    30000    35000    40000    45000    50000    55000
B       10000    15000    20000    25000    30000    35000    40000    45000    50000    55000
C       10000    15000    20000    25000    30000    35000    40000    45000    50000    55000

尝试使用functools中的reduce合并数据帧。

#import package
import pandas as pd
from functools import reduce
#reproducing your dataframe images...
df1 = pd.DataFrame({"Name" : ["A","B"], 
"Salary" : [20000,25000]})
df2 = pd.DataFrame({"Name" : ["A","B"], 
"Salary" : [21000,26000]})
#Create a list of all dfs
dfs = [df1, df2,]
#merge on name
df3 = reduce(lambda left,right: pd.merge(left,right,on='Name'), dfs)
#rename the columns
df3.columns = ['Name', 'Sal_Jan', 'Sal_Feb']

仅重命名特定列的替代方法：

df3 = df3.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})

我认为您不需要concat，而是需要合并数据帧。如果你认为每个月都有不同的员工，最好的选择是完全外部加入数据帧。

命令如下：

pd.merge(first_pd,second_pd,on='Name',how='outer')

如果你的数据帧总是有一列引用名为"；名称">outer意味着，如果员工姓名与任何数据帧不匹配，其位置将被替换为NaN

您可以按照以下要点浏览整个解决方案：https://gist.github.com/irongraft/c12895419fa241adc03ec0635e45aebe

ciao

试试这个简单的解决方案：

import os 
import panda as pd
directory = r'SalaryFilesDirectoryPath'
dfresult=pd.DataFrame({})
for filename in os.listdir(directory):
df=pd.read_csv(filename)
dfresult['Name']=df['Name']
dfresult[filename[8:-4]]=df['Salary']

dfresult.to_csv('outputname.csv')

附言：只有当你按字母顺序命名每个csv时，这才会起作用，否则，你会有未排序的月份+文件夹必须只包含csv工资文件+如果你在每个csv中有不同的名称，你需要升级它。

相关内容

最新更新

热门标签：