将Pandas数据帧转换为无向边缘列表



给定一个边的数据帧,我想创建一个具有列"的聚合数据帧;频率";包含两个节点之间的总边。我也希望边缘列表是无向的,所以如果存在A=>B=1,我也想有一行,例如B=>A=1。

原始数据

import pandas as pd
data = pd.DataFrame({'x': ['jane','jane','jack','bill','jack','terra'],
'y': ['jack','jack','jane','terra','terra', 'jack']})
x      y
0   jane   jack
1   jane   jack
2   jack   jane
3   bill  terra
4   jack  terra
5  terra   jack

预期输出

x      y  frequency
0   jane   jack          3
1   jack   jane          3
2   bill  terra          1
3   jack  terra          2
4  terra   jack          2

尝试过这个

## Get size of of one direction for edge list
data=data.groupby(['x','y']).size().reset_index() 
## rename column to 'frequency'
data.rename(columns = {0:'frequency'}, inplace = True) 
## copy dataframe to calculate other direction of edgelist 
data2 = data.copy() 
## reverse the names of columns
data2.rename(columns = {'x':'y', 'y':'x'}, inplace = True) 
## merge
data2 = data.merge(data2, left_on=['x','y'],right_on=['x','y'], suffixes = ['1','2']) 
## add the frequency to get total edge strength
data2['frequency'] = data2['frequency1']+data2['frequency2'] 
data3 = data2[['x','y','frequency']]

x      y  frequency
0   jack   jane          3
1   jack  terra          2
2   jane   jack          3
3  terra   jack          2

这个最终结果运行得很好,我不关心行的顺序。但问题是,我错过了比尔和泰拉的一场比赛。由于我的合并方式,它丢失了,因为我最初只有bill=>没有terra的terra=>比尔,所以这一排被丢弃了。

我想知道如何识别将被丢弃的行并将它们连接回,或者是否有更好的方法?

尽管如此,我还是发现了一种方法来实现我想要的结果。该方法使用嵌套的apply((。首先,我为将要工作的案例创建了带有频率列的数据帧,上面已经概述了(,它将作为连接到不工作的案例上的基本框架。

## Same steps as before to calculate frequency column
data=data.groupby(['x','y']).size().reset_index() 
data.rename(columns = {0:'frequency'}, inplace = True) 
##Identify which of those cases will not work using a nested apply function. 
Inner loop returns the opposite direction of the edge and outer loop checks 
the sum of all cases where the original edge has another in the other 
direction.
Code these as 0's and 1's, and filter 0 to identify which edges 
need to be manually created and appended to the final result.
remaining_rows = data.loc[data.apply(lambda x: 1 if sum((x['x'], x['y']) == 
data.apply(lambda x: (x['y'], x['x']), axis = 1))>=1 else 0, axis = 1) ==0]
remaining_rows2.rename(columns = {'x':'y','y':'x'}, inplace = True)
remaining_rows = pd.concat([remaining_rows, remaining_rows2])
remaining_rows
x      y  frequency
0  bill  terra          1
##Create the edges for the other direction and concat
remaining_rows2 = remaining_rows.copy()
remaining_rows2.rename(columns = {'x':'y','y':'x'}, inplace = True)
remaining_rows = pd.concat([remaining_rows, remaining_rows2])
remaining_rows
x      y  frequency
0   bill  terra          1
0  terra   bill          1
## Yes! This is the piece that I can concat onto the other data frame so 
that I have a complete Edge list with a frequency column for each edge 
A=>B and B=>A.  After concatenating, remember to specify 
reset_index=True

最新更新