在python中查找并消除数据对的倒数

这是一个令人费解的问题。我正在重构一些代码，所以关键目标是保持事情简单。

这段代码应该定位日期中现有的对等对，所以我可以处理它们(在另一个函数中)。它必须返回倒数的位置，df的直线出现的位置。互反意义(a,b) == (b,a)。

解释我的场景:我尽我所能在下面的简单数据中重现了这种情况。原始数据集实际上是"n"的所有可能排列的列表。元素成对。此时，代码中的(a,b)与(b,a)可能是不同的东西，因此我需要将它们保留在那里并处理它们。

我用了"random"为了模拟这样一个事实，在我所在的代码点，一些往复式已经通过其他过滤器/处理消除了。

那么我现在工作的函数，&;reciprocals_locator&;，应该把我指向数据帧中仍然发生往复式并需要进一步处理的行。

原来的解决方案，我试图重构，是一个复杂的循环，在循环内操纵数据帧的方式。所有这些我都不喜欢出现在我的最终代码中。我能够重构这个循环来创建我需要的列表，然后将其重构为列表推导式，这更好，但仍然相当不可读。

问题(1):是否有更好的方法/算法或甚至外部/导入函数以更精简的方式完成技巧?现在，上面的解决方案有点不完整，因为它返回一个列表，其中也包含位置的往复式!

如下所示，我最终知道在行(1,5)中有重复项，但它也显示了倒数解(5,1)!我最初以为这是一个简单的"挑一半"的例子。但这可能是因为itertools.product的工作方式，坦率地说，我还没有深入研究过。

同样发生在我原来的循环代码实现上。也许有一个更简单的，itertools-based解决方案，我现在知道。

**希望下面所有的代码都足够简单

### Problem scenario recreation
import itertools as it
import pandas as pd
import random as rd
## Original dataset
sourcelst = ['a','b','c','d','e']
pairs = [list(perm) for perm in it.permutations(sourcelst,2)]
df = pd.DataFrame(pairs, columns=['y','x'])
## Dataset after some sort of processing, some reciprocals are eliminated,
## but some stay and we nee to process them separately
drop_rows = [rd.randint(0, len(pairs)-1) for _ in range(2)]
df.drop(index=drop_rows, inplace=True)
df.reset_index(inplace=True, drop=True)

## Finding reciprocals
### Original LOOPS implementation, this one shows the actual pairs
### for ease of analysis
def reciprocal_pairs_orig(df):
reciprocals = []
row_y = 0
while row_y < len(df.index):
for row_x in range(len(df.index)):
if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y]):

reciprocals.append([[df['y'].iloc[row_x], df['x'].iloc[row_x]],[df['y'].iloc[row_y], df['x'].iloc[row_y]]])
row_y += 1
return reciprocals
### List comprehension refactor, showing the pairs
def reciprocal_pairs_refactor(df):
return [[[df['y'].iloc[row_x], df['x'].iloc[row_x]], [df['y'].iloc[row_y], df['x'].iloc[row_y]]]
for row_y, row_x in it.product(range(len(df.index)), range(len(df.index)))
if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y])]

### This is the actual function that I want to use
def reciprocal_locator_orig(df):
reciprocals = []
row_y = 0
while row_y < len(df.index):
for row_x in range(len(df.index)):
if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y]):

reciprocals.append([row_y, row_x])
row_y += 1
return reciprocals
### List comprehension refactor
def reciprocal_locator_refactor(df):
return [[row_y, row_x]
for row_y, row_x in it.product(range(len(df.index)), range(len(df.index)))
if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y])]

product(range(n),range(n))将给你n**2对-例如，正如你所指出的，(1,5)和(5,1)，所以这不是一个很好的解决方案。您的原始代码，嵌套的for循环，做同样的。

第一个改进是

for x in range(n-1):
for y in range(x,n):
...

这仍然是O(n2)，但至少它只检查每个可能的配对一次。

但是另一种方法可能更有效:只在行上循环一次，同时保持frozensets的dict(正常的sets不可哈希)，其中包含您已经遇到的对作为键，并将行号作为值。对于每一行，您只需检查对是否已经在dict中-然后它是一个"往复式";在当前行和其编号存储在dict中的行之间-否则添加新对，依此类推。Soething:

seen = {}
for row in range(len(df.index)):
fs = frozenset([df['x'].iloc[row], df['y'].iloc[row]])
if fs in seen:
reciprocals.append((seen[fs],row))
else:
seen[fs] = row

相关内容

最新更新

热门标签：