如何在一个函数中连接一个数据框架，该函数为每个文件更新?

我有一个文件夹'Data'包含5个文件。每个文件都一个接一个地遍历'filter_seq'函数。此函数包含一些过滤器，用于减少/过滤文件中的数据。

def filter_seq(df2,count):
print('Filter 1.' + str(count))
T1_df = df2[((df2['Angle'] > (df2['Sim_Angle'] - 1)) & (df2['Angle'] < (df2['Sim_Angle'] + 1)))]
T1_df = T1_df[((T1_df['Velocity'] > (T1_df['Sim_Velocity'] - 2)) & (T1_df['Velocity'] < (T1_df['Sim_Velocity'] + 2)))]

在此过滤之后，我想要一个包含所有文件的所有过滤数据框的另一个数据框。

假设对文件1进行过滤后，T1_df的形状为100 × 15，对文件2进行过滤后的形状为89 × 15。我想要一个形状为189 x 15的最终数据帧。

如何获得最终数据帧?如何改进过滤功能?

最简单的解决方案可能是将所有过滤的数据帧附加到列表中，然后使用pd.concat函数。例如:

import numpy as np
import pandas as pd
def filter_and_append(df, l):
"""
df is the dataframe to be filtered, and appended to the list.
l is the list the filtered dataframe will be appended to
"""

df_filtered = df # put you filter logic here
l.append(df_filtered)
return l
l = []
for file in range(3):
# here you could load the data, but just create toy df for illustration
df_tmp = pd.DataFrame(np.random.randn(3,3))
l = filter_and_append(df_tmp, l)
full_df = pd.concat(l, axis=0)
full_df

如果您需要跟踪数据来自哪个文件(例如。为了确保索引是唯一的(在我的示例中不是这样)，您可以在过滤器和追加函数中处理它，例如:

def filter_and_append(df, file, l):
"""
df is the dataframe to be filtered, and appended to the list.
file is the file (the data was loaded from)
l is the list the filtered dataframe will be appended to
"""

df_filtered = df # put you filter logic here
df_filtered['file_name'] = file
df_filtered.reset_index('file_name', append=True, inplace=True)
l.append(df_filtered)
return l

关于如何改进你的过滤功能，在不知道你所说的改进是什么意思的情况下很难说。例如，它是否没有达到您想要的输出?还是太慢了?它会抛出错误吗?

就一般的可读性而言，将你的一些逻辑分成多行可能是值得的，但如果你是唯一一个阅读代码的人，那么这真的只是一个品味问题。

如果我们在这里取一个简化的数据框架示例:

import pandas as pd
import numpy as np
df = pd.DataFrame({"Angle": np.random.randint(0, 20, 100),
"Sim_Angle": np.random.randint(0, 20, 100),
"Velocity": np.random.randint(0, 20, 100),
"Sim_Velocity": np.random.randint(0, 20, 100)})
df_2 = pd.DataFrame({"Angle": np.random.randint(0, 20, 100),
"Sim_Angle": np.random.randint(0, 20, 100),
"Velocity": np.random.randint(0, 20, 100),
"Sim_Velocity": np.random.randint(0, 20, 100)})
df_3 = pd.DataFrame({"Angle": np.random.randint(0, 20, 100),
"Sim_Angle": np.random.randint(0, 20, 100),
"Velocity": np.random.randint(0, 20, 100),
"Sim_Velocity": np.random.randint(0, 20, 100)})

然后我们可以创建一个数据框架列表:

files = [df, df_2, df_3]

和函数的简化版本:

def filter_seq(df2, count):
print('Filter 1.' + str(count))
T1_df = df2[(df2["Angle"].between(df2["Sim_Angle"]-1, df2["Sim_Angle"]+1)) &
(df2["Velocity"].between(df2["Sim_Velocity"]-2, df2["Sim_Velocity"]+2))]

return T1_df

这里我用.between()，这样df2["Angle"]就不需要重复了，我用&，就像你做的那样，但是把两行代码合并为一。

然后你可以使用pd.concat()和通过你的函数传递的文件的列表理解:

df_all = pd.concat([filter_seq(f, i) for i, f in enumerate(files)], ignore_index=True)

相关内容

最新更新

热门标签：