我想优化我的代码,其中Group by +- margin使用python。我想分组我的Dataframe组成的2列['1','2']基于边距+-1(1)和+-10 (2)
例如,一个真正简化的俯瞰
[[273, 10],[274, 14],[275, 15]]
预期输出:
[[273, 10],[274, 14]],[[274, 14],[275, 15]]
我的数据要复杂得多,有近100万个数据点,看起来像这样652.125454455
例如,这种代码占用了我很长时间,没有结果
a = np.random.uniform(low=300, high=1800, size=(300000,))
b = np.random.uniform(low=0, high=7200, size=(300000,))
print("Random numbers were created")
df = pd.DataFrame({'1': a, '2':b})
df['id'] = df.index
1_MARGIN = 1
2_MARGIN = 10
tic = time.time()
group = []
for index, row in df.iterrows():
filtered_df = df[(row['1'] - 1_MARGIN < df['1']) & (df['1'] < row['1'] + 1_MARGIN) &
(row['2'] - 2_MARGIN < df['2']) & (df['2'] < row['2'] + 2_MARGIN)]
group.append(filtered_df[['id', '1']].values.tolist())
toc = time.time()
print(f"for loop: {str(1000*(toc-tic))} ms")
我也试过
data = df.groupby('1')['2'].apply(list).reset_index(name='irt')
但是这里没有边距
我尽了最大的努力去理解你想要什么,我得到了一个非常缓慢的解决方案,但至少它是一个解决方案。
import pandas as pd
import numpy as np
a = np.random.uniform(low=300, high=1800, size=(300000,))
b = np.random.uniform(low=0, high=7200, size=(300000,))
df = pd.DataFrame({'1': a, '2':b})
dfbl1=np.sort(df['1'].apply(int).unique())
dfbl2=np.sort(df['2'].apply(int).unique())
MARGIN1 = 1
MARGIN2 = 10
marg1array=np.array(range(dfbl1[0],dfbl1[-1],MARGIN1))
marg2array=np.array(range(dfbl2[0],dfbl2[-1],MARGIN2))
a=time.perf_counter()
groupmarg1=[]
groupmarg2=[]
for low,upper in zip(marg1array[:-1],marg1array[1:]):
for low2,upper2 in zip(marg2array[:-1],marg2array[1:]):
groupmarg1.append(df.loc[(df['1']>low) & (df['1']<upper)&(df['2']>low2) & (df['2']<upper2)].values.tolist())
print(time.perf_counter()-a)
我也试着分别做每个循环,并将它们相交,这应该更快,但由于我们存储。values.tolist(),我找不到比下面更快的方法。
a=time.perf_counter()
groupmarg1=[]
groupmarg2=[]
for low,upper in zip(marg1array[:-1],marg1array[1:]):
groupmarg1.append(df.loc[(df['1']>low) & (df['1']<upper)])
newgroup=[]
for subgroup in groupmarg1:
for low2,upper2 in zip(marg2array[:-1],marg2array[1:]):
newgroup.append(subgroup.loc[(subgroup['2']>low2) & (subgroup['2']<upper2)].values.tolist())
print(time.perf_counter()-a)
在我的机器上运行大约9分钟。哦,你需要过滤掉空的数据帧,如果你想要它们作为value .tolist(),你可以像这样过滤
gr2=[grp.values.tolist() for grp in newgroup if not grp.empty]