在我的脚本中,我循环在多个子目录上,并创建子目录中3个文件的数据框架。我想将每个子目录的输出写入每个子目录,但是我的代码给出了一个错误:" df1未定义"在使用
的线上dfmerge1 = pd.merge(df1, df2, on=['genome', 'contig'], how='outer')
这可能是由于并非所有文件都存在于子目录中,并且脚本停止。我希望脚本继续下一个子目录,如果一个子dir不包含所有三个文件。我怎样才能做到这一点?
我的代码是
import os
import pandas as pd
print('Start merging contig files')
for root, dirs, files in os.walk(os.getcwd()):
filepath = os.path.join(root, 'genes.faa.genespercontig.csv')
if os.path.isfile(filepath):
with open(filepath, 'r') as f1:
df1 = pd.read_csv(f1, header=None, delim_whitespace=True, names = ["contig", "genes"])
df1['genome'] = os.path.basename(os.path.dirname(filepath))
filepath = os.path.join(root, 'hmmer.analyze.txt.results.txt')
if os.path.isfile(filepath):
with open(filepath, 'r') as f2:
df2 = pd.read_csv(f2, header=None, delim_whitespace=True, names = ["contig", "SCM"])
df2['genome'] = os.path.basename(os.path.dirname(filepath))
filepath = os.path.join(root, 'genes.fna.output_blastplasmiddb.out.count_plasmiddbhit.out')
if os.path.isfile(filepath):
with open(filepath, 'r') as f3:
df3 = pd.read_csv(f3, header=None, delim_whitespace=True, names = ["contig", "plasmid_genes"])
df3['genome'] = os.path.basename(os.path.dirname(filepath))
#merge dataframes
dfmerge1 = pd.merge(df1, df2, on=['genome', 'contig'], how='outer')
df_end = pd.merge(dfmerge1, df3, on=['genome', 'contig'], how='outer')
#set NaN in columns to 0
nan_cols = df_end.columns[df_end.isnull().any(axis=0)]
for col in nan_cols:
df_end[col] = df_end[col].fillna(0).astype(int)
df_end.to_csv(os.path.join(root, 'outputgenesdf.csv'))
您正确地检查了filepath
的存在,但您没有处理不存在文件的情况。因此,如果文件不存在,则df1
将是上一个循环迭代中的剩余值,或者如果这是第一次通过循环的情况。
if os.path.isfile(filepath):
with open(filepath, 'r') as f1:
df1 = pd.read_csv(f1, header=None, delim_whitespace=True, names = ["contig", "genes"])
df1['genome'] = os.path.basename(os.path.dirname(filepath))
else:
continue