在循环中多次过滤panda数据帧的最快方法



我有一个包含300万行的数据帧(df1(和另一个包含10k行的数据框(df2(。为df2中的每一行过滤df1的最快方法是什么?

这正是我在循环中需要做的:

for i in list(range(len(df2))): #For each row
x = df1[(df1['column1'].isin([df2['info1'][i]])) 
& (df1['column2'].isin([df2['info2'][i]])) 
& (df1['column3'].isin([df2['info3'][i]]))]
# ..... More code using x variable every time ......

此代码不够快,无法运行。

请注意,我使用了.isin函数,但它里面总是只有一个项。我发现使用.isi((df1['column1'].isin([df2['info1'][i]]比使用df1['column1'] == df2['info1'][i]更快。

import pandas as pd
import numpy as np

def make_filter(x, y, match_dict, uinque=False):
filter = None
for x_key in x.columns:
if x_key in match_dict:
y_key = match_dict[x_key]
y_col = y[y_key]
if uinque:
y_col = y_col.unique()
col_filter = x[x_key].isin(y[y_key])
if filter is None:
filter = col_filter
else:
filter = filter & col_filter
return filter

def main():
n_rows = 100
x = np.random.randint(4, size=(n_rows, 2))
x = pd.DataFrame(x, columns=["col1", "col2"])
y = np.random.randint(2, 4, size=(n_rows, 2))
y = pd.DataFrame(y, columns=["info1", "info2"])
match_dict = {"col1":"info1", "col2": "info2"}
z = make_filter(x, y, match_dict, uinque=True)
print(x[z])

main()

最新更新