给定一个边的数据帧,我想创建一个具有列"的聚合数据帧;频率";包含两个节点之间的总边。我也希望边缘列表是无向的,所以如果存在A=>B=1,我也想有一行,例如B=>A=1。
原始数据
import pandas as pd
data = pd.DataFrame({'x': ['jane','jane','jack','bill','jack','terra'],
'y': ['jack','jack','jane','terra','terra', 'jack']})
x y
0 jane jack
1 jane jack
2 jack jane
3 bill terra
4 jack terra
5 terra jack
预期输出
x y frequency
0 jane jack 3
1 jack jane 3
2 bill terra 1
3 jack terra 2
4 terra jack 2
尝试过这个
## Get size of of one direction for edge list
data=data.groupby(['x','y']).size().reset_index()
## rename column to 'frequency'
data.rename(columns = {0:'frequency'}, inplace = True)
## copy dataframe to calculate other direction of edgelist
data2 = data.copy()
## reverse the names of columns
data2.rename(columns = {'x':'y', 'y':'x'}, inplace = True)
## merge
data2 = data.merge(data2, left_on=['x','y'],right_on=['x','y'], suffixes = ['1','2'])
## add the frequency to get total edge strength
data2['frequency'] = data2['frequency1']+data2['frequency2']
data3 = data2[['x','y','frequency']]
x y frequency
0 jack jane 3
1 jack terra 2
2 jane jack 3
3 terra jack 2
这个最终结果运行得很好,我不关心行的顺序。但问题是,我错过了比尔和泰拉的一场比赛。由于我的合并方式,它丢失了,因为我最初只有bill=>没有terra的terra=>比尔,所以这一排被丢弃了。
我想知道如何识别将被丢弃的行并将它们连接回,或者是否有更好的方法?
尽管如此,我还是发现了一种方法来实现我想要的结果。该方法使用嵌套的apply((。首先,我为将要工作的案例创建了带有频率列的数据帧,上面已经概述了(,它将作为连接到不工作的案例上的基本框架。
## Same steps as before to calculate frequency column
data=data.groupby(['x','y']).size().reset_index()
data.rename(columns = {0:'frequency'}, inplace = True)
##Identify which of those cases will not work using a nested apply function.
Inner loop returns the opposite direction of the edge and outer loop checks
the sum of all cases where the original edge has another in the other
direction.
Code these as 0's and 1's, and filter 0 to identify which edges
need to be manually created and appended to the final result.
remaining_rows = data.loc[data.apply(lambda x: 1 if sum((x['x'], x['y']) ==
data.apply(lambda x: (x['y'], x['x']), axis = 1))>=1 else 0, axis = 1) ==0]
remaining_rows2.rename(columns = {'x':'y','y':'x'}, inplace = True)
remaining_rows = pd.concat([remaining_rows, remaining_rows2])
remaining_rows
x y frequency
0 bill terra 1
##Create the edges for the other direction and concat
remaining_rows2 = remaining_rows.copy()
remaining_rows2.rename(columns = {'x':'y','y':'x'}, inplace = True)
remaining_rows = pd.concat([remaining_rows, remaining_rows2])
remaining_rows
x y frequency
0 bill terra 1
0 terra bill 1
## Yes! This is the piece that I can concat onto the other data frame so
that I have a complete Edge list with a frequency column for each edge
A=>B and B=>A. After concatenating, remember to specify
reset_index=True