在不使用循环的情况下在单个DataFrame中查找子行



目前我正试图从票证系统(Redmine(中提取大量数据。我的任务之一是从我感兴趣的票列表中找到所有的子票。由于孩子们和他们的父母形状相同,住在同一个DataFrame中,所以我不能使用pd.merge,而且孩子们自己也可以有孩子,我首先试着递归地找到他们,但很快就被绊倒了

最大递归深度超过错误

所以我的下一个方法是序列化该过程。不幸的是,这让我在嵌套循环中多次查找这样的票证,这对我的需求来说实在太慢了。因此,为了不让你的想象力进一步过度,这里有一个我正在研究的数据的可能例子:

id  project.id  status.id  priority.id  author.id  assigned_to.id  parent.id  category.id
0    18543          18          5            2         85            85.0    18203.0          NaN
1    18542          18          5            2         85            85.0    18538.0          NaN
2    18541          71          5            2         67            67.0    17788.0          NaN
3    18540          18          5            3        105            85.0        NaN        150.0
4    18539          17          5            2         81            81.0    18537.0          NaN
..     ...         ...        ...          ...        ...             ...        ...          ...
806  18257           4          1            2          3             NaN    16423.0          NaN
807  17738          11          1            2          3             NaN    17737.0          NaN
808  16017          65          2            2         81             NaN        NaN          NaN
809   2473          65         15            2          4             4.0        NaN          NaN
810  16423          65         18            2          3            18.0        NaN          NaN
[811 rows x 8 columns]

把它想象成一个层次树结构。正如您所看到的,从下到下遍历与其父级的id字段匹配的parent.id字段是非常容易的,但从上到下遍历并不是那么简单。

我想出的一个解决方案是:

def findChildren(issueId, issueFrame):
# clean data
safeElements = issueFrame.fillna(0)
children = safeElements[safeElements['parent.id'] == issueId]
childList = np.array(children['id'])
listLength = 0
# seek until list of children does not grow anymore
while listLength != len(childList):
listLength = len(childList)
for identifier in childList:
children = safeElements[safeElements['parent.id'] == identifier]
addToList = np.array(children['id'])
childList = np.append(childList, addToList)
childList = np.unique(childList)
return childList

它有效!但由于我不必只寻找一个问题,所以我需要几分钟的时间来创建我想要的所有孩子的列表。什么是更快的方法?结果不需要出现在子项列表中。我也很高兴有一个经过过滤的DataFrame,它把孩子们的一排排保存到他们的曾孙,等等。

代码中最大的性能瓶颈是数组匹配。搜索数组是O(n)操作。对另一个数组中的每个元素重复该操作,使操作O(n*m)。为了获得更快的结果,请查找字典,其查找时间始终为O(1)

有一种方法可以做到这一点,而不需要递归。试试这个:

def findChildren(issueId, issueFrame, cache=None):
# Create a dictionary which lists all children an `id` has
# The result is something like this:
# {
#   1: [2,3,4], # 1 has children 2, 3, 4
#   2: [5]      # 2 has child 5
# }
_cache = cache or issueFrame[['parent.id', 'id']].groupby('parent.id')['id'].apply(list).to_dict()
# We start with the supplied `issueId`
check_list = [issueId]
# The final list of children
ids = []
while len(check_list) > 0:
parent_id = check_list.pop()
# Get all children of the current `parent_id`...
children = _cache.get(parent_id, [])
# ... then check if they have an children too
check_list += children
# and add them to the list of children for the current ticket
ids += children
# Finally, extract those children from the DataFrame
return issueFrame[issueFrame['id'].isin(ids)]

最新更新