如何通过+- margin优化Pandas组



我想优化我的代码,其中Group by +- margin使用python。我想分组我的Dataframe组成的2列['1','2']基于边距+-1(1)和+-10 (2)
例如,一个真正简化的俯瞰

[[273, 10],[274, 14],[275, 15]]

预期输出:

[[273, 10],[274, 14]],[[274, 14],[275, 15]]

我的数据要复杂得多,有近100万个数据点,看起来像这样652.125454455

例如,这种代码占用了我很长时间,没有结果

a = np.random.uniform(low=300, high=1800, size=(300000,))
b = np.random.uniform(low=0, high=7200, size=(300000,))
print("Random numbers were created")
df = pd.DataFrame({'1': a, '2':b})
df['id'] = df.index
1_MARGIN = 1
2_MARGIN = 10
tic = time.time()
group = []
for index, row in df.iterrows():

filtered_df = df[(row['1'] - 1_MARGIN < df['1']) & (df['1'] < row['1'] + 1_MARGIN) & 
(row['2'] - 2_MARGIN < df['2']) & (df['2'] < row['2'] + 2_MARGIN)]
group.append(filtered_df[['id', '1']].values.tolist())
toc = time.time()
print(f"for loop: {str(1000*(toc-tic))} ms")

我也试过

data = df.groupby('1')['2'].apply(list).reset_index(name='irt')

但是这里没有边距

我尽了最大的努力去理解你想要什么,我得到了一个非常缓慢的解决方案,但至少它是一个解决方案。

import pandas as pd
import numpy as np
a = np.random.uniform(low=300, high=1800, size=(300000,))
b = np.random.uniform(low=0, high=7200, size=(300000,))
df = pd.DataFrame({'1': a, '2':b})
dfbl1=np.sort(df['1'].apply(int).unique())
dfbl2=np.sort(df['2'].apply(int).unique())
MARGIN1 = 1
MARGIN2 = 10
marg1array=np.array(range(dfbl1[0],dfbl1[-1],MARGIN1))
marg2array=np.array(range(dfbl2[0],dfbl2[-1],MARGIN2))
a=time.perf_counter()
groupmarg1=[]
groupmarg2=[]
for low,upper in zip(marg1array[:-1],marg1array[1:]):
for low2,upper2 in zip(marg2array[:-1],marg2array[1:]):
groupmarg1.append(df.loc[(df['1']>low) & (df['1']<upper)&(df['2']>low2) & (df['2']<upper2)].values.tolist())
print(time.perf_counter()-a)

我也试着分别做每个循环,并将它们相交,这应该更快,但由于我们存储。values.tolist(),我找不到比下面更快的方法。

a=time.perf_counter()
groupmarg1=[]
groupmarg2=[]
for low,upper in zip(marg1array[:-1],marg1array[1:]):
groupmarg1.append(df.loc[(df['1']>low) & (df['1']<upper)])    
newgroup=[]
for subgroup in groupmarg1:
for low2,upper2 in zip(marg2array[:-1],marg2array[1:]):
newgroup.append(subgroup.loc[(subgroup['2']>low2) & (subgroup['2']<upper2)].values.tolist())
print(time.perf_counter()-a)

在我的机器上运行大约9分钟。哦,你需要过滤掉空的数据帧,如果你想要它们作为value .tolist(),你可以像这样过滤

gr2=[grp.values.tolist() for grp in newgroup if not grp.empty]

最新更新