如何在一批CSV文件上运行相同的panda操作



我在各个CSV文件上运行以下代码:

import pandas as pd
header_names = ['column1', 'column2', 'column3', 'column4', 'column5', 'column6']
df = pd.read_csv('some.csv', delimiter= "|", skiprows=1, names=header_names)
df['column2'].replace(['bad', 'worse'],['good', 'better'],inplace =True)
df.to_csv('new.csv', index=False)

如何在一批文件上运行此代码,而不是更改每个文件名的代码?

编辑:如果我想修改的所有CSV文件都在一个文件夹中,这会更容易吗?

为多个文件运行此逻辑:

import os
import pandas as pd
def my_function(source_file, target_suffix):
header_names = ['column1', 'column2', 'column3', 'column4', 'column5', 'column6']    
df = pd.read_csv(source_file, delimiter= "|", skiprows=1, names=header_names)    
df['column2'].replace(['bad', 'worse'],['good', 'better'],inplace =True)
# Generating output file name based on the input file name, and the provided suffix
target_file = source_file.replace('.csv', '') + '_output_' + str(target_suffix) + '.csv' 
df.to_csv(target_file, index=False)
# Define list of input_file    
file_names = []
# Set the directory of your files
directory = os.path.join("c:\","path")
# Locate the input csv file
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".csv"):
file_names.append(file)
# Looping on the input file names and their indexes using enumerate
for i, source_file in enumerate(file_names):
# Call the function on the file name, and passing file index as the suffix
my_function(source_file, i)

正如gtomer和Aviv Yaniv的回答,我想将这个答案扩展为如何获得目标目录中文件的名称。您可以使用以下代码获取目录中csv文件的名称。

csv_files=[]
import os 
directory=os.path.join("Path to the directory")
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".csv"):
print(file)
csv_files.append(pd.read_csv(file,error_bad_lines=False))

您可以在for循环中运行逻辑,对每个文件进行迭代。

您可以在文件之间循环:

files = ['file1.csv','file2.csv','file3.csv']
agg_df = pd.DataFrame()
for file in files:
df = pd.read_csv(file, delimiter= "|", skiprows=1, names=header_names)
df['column2'].replace(['bad', 'worse'],['good', 'better'],inplace =True)
agg_df = agg_df.append(df)
agg_df.to_csv('new.csv', index=False)
import pandas as pd
import dask.dataframe as dd
header_names = ['column1', 'column2', 'column3', 'column4', 'column5', 'column6']
df = dd.read_csv("./directory/*.csv", delimiter= "|", skiprows=1, names=header_names)
df['column2'] = df['column2'].replace(['bad', 'worse'],['good', 'better'])
df.to_csv("./directory/export-*.csv")

有很多方法可以更改导出名称。read_csv中的*将读取所有以.csv结尾的文件。导出时,to_csv上的*将对所有不同的文件进行编号。不需要使用此方法的for循环。

最新更新