适用于Pandas DataFrame的*交错*组



我有一个3轴数据的DataFrames,带有一个用于分组的成员标签:

df = pd.DataFrame( [[0, 1, 2,  0], 
[-1, 0, 1, 0],
[-2, 0, 3, 1],
[1, 1, 3,  1],
[1, 0, 2,  2],
[1, 0, 3,  2],
[6, 2, 1,  5],
[-4, 3, 0, 5],
[1, 0, -1, 6],
[0, 0, 3,  6]], columns = ['x', 'y', 'z', 'member'])

我的目标有点人为:我希望找到每组点与下一个n_skip组之间的成对距离,从最小到最大。这个n_skip就是我所说的交错:

例如,对于n_skip=2,我希望找到的距离

  • 具有member == 0->针对member == 1, 2
  • 具有member == 1->针对member == 2, 5
  • 具有member == 2->针对member == 5, 6
  • 具有CCD_ 10->针对member == 6
  • 不计算CCD_ 12

在没有嵌套for循环的情况下,有没有一种高性能的方法可以做到这一点?这一点在这个问答中有所暗示。直觉上,我无法使用传统的apply来并行化Pandas DataFrames上的函数。将函数应用于交错组的快速方法是什么?


EDIT1我的解决方案(仅适用于一个轴(:

## Heading ### Organize by group membership
groups = df.groupby('member')
# Define constants
max_member = 6
n_skip = 2
start_row = 0
matrix = np.zeros((df.shape[0], df.shape[0]))
# Iterate for each group
for i in range(max_member):
try:
pts_curr = groups.get_group(i)
except KeyError:
continue
# Save end row index 
end_row = start_row + pts_curr.shape[0]    
# Save start col index
start_col = end_row

# Grab the destination group nodes
for j in range(i+1, int(np.min([i+n_skip+1, max_member]))):
try:
pts_clr_next = groups.get_group(j)
except KeyError:
continue
# Save end col index
end_col = start_col + pts_clr_next.shape[0]
# Calculate cdist
z_sq = cdist(pts_curr[['z']], pts_next[['z']])
# Save results in matrix at right positions
matrix[start_row:end_row, start_col:end_col] = z_sq

# update col index
start_col = end_col
# update row index
start_row = end_row

4K行上的交叉合并还不错(产生大约16M行(。让我们尝试交叉合并和查询:

n = 2
# dummy key
df['dummy'] = 1
# this is the member group number
df['rank'] = df['member'].rank(method='dense')
# cross merge and filter
new_df = (df.merge(df, on='dummy')
.query('rank_x<rank_y<=rank_x+@n')
)
# euclidean distance
dist = (new_df[['x_x','y_x','z_x']].sub(new_df[['x_y','y_y','z_y']].values)**2).sum(1)**.5
# output dataframe with member label
pd.DataFrame({'member1':new_df['member_x'], 'member2':new_df['member_y'],
'dist':dist})

输出:

member1  member2      dist
2         0        1  2.449490
3         0        1  1.414214
4         0        2  1.414214
5         0        2  1.732051
12        0        1  2.236068
13        0        1  3.000000
14        0        2  2.236068
15        0        2  2.828427
24        1        2  3.162278
25        1        2  3.000000
26        1        5  8.485281
27        1        5  4.690416
34        1        2  1.414214
35        1        2  1.000000
36        1        5  5.477226
37        1        5  6.164414
46        2        5  5.477226
47        2        5  6.164414
48        2        6  3.000000
49        2        6  1.414214
56        2        5  5.744563
57        2        5  6.557439
58        2        6  4.000000
59        2        6  1.000000
68        5        6  5.744563
69        5        6  6.633250
78        5        6  5.916080
79        5        6  5.830952

选项2:如果有大数据帧,循环可能不会太糟糕:

from scipy.spatial.distance import cdist
ret = []
for i in set(df['rank']):
this_group = df['rank']==i
other_groups = df['rank'].between(i,i+n, inclusive=False)
t = df.loc[this_group,['x','y','z']].values
o = df.loc[other_groups,['x','y','z']].values
ret.append(cdist(t,o).ravel())
dist = np.concatenate(ret)

最新更新