目前我正试图从票证系统(Redmine(中提取大量数据。我的任务之一是从我感兴趣的票列表中找到所有的子票。由于孩子们和他们的父母形状相同,住在同一个DataFrame中,所以我不能使用pd.merge
,而且孩子们自己也可以有孩子,我首先试着递归地找到他们,但很快就被绊倒了
最大递归深度超过错误
所以我的下一个方法是序列化该过程。不幸的是,这让我在嵌套循环中多次查找这样的票证,这对我的需求来说实在太慢了。因此,为了不让你的想象力进一步过度,这里有一个我正在研究的数据的可能例子:
id project.id status.id priority.id author.id assigned_to.id parent.id category.id
0 18543 18 5 2 85 85.0 18203.0 NaN
1 18542 18 5 2 85 85.0 18538.0 NaN
2 18541 71 5 2 67 67.0 17788.0 NaN
3 18540 18 5 3 105 85.0 NaN 150.0
4 18539 17 5 2 81 81.0 18537.0 NaN
.. ... ... ... ... ... ... ... ...
806 18257 4 1 2 3 NaN 16423.0 NaN
807 17738 11 1 2 3 NaN 17737.0 NaN
808 16017 65 2 2 81 NaN NaN NaN
809 2473 65 15 2 4 4.0 NaN NaN
810 16423 65 18 2 3 18.0 NaN NaN
[811 rows x 8 columns]
把它想象成一个层次树结构。正如您所看到的,从下到下遍历与其父级的id
字段匹配的parent.id
字段是非常容易的,但从上到下遍历并不是那么简单。
我想出的一个解决方案是:
def findChildren(issueId, issueFrame):
# clean data
safeElements = issueFrame.fillna(0)
children = safeElements[safeElements['parent.id'] == issueId]
childList = np.array(children['id'])
listLength = 0
# seek until list of children does not grow anymore
while listLength != len(childList):
listLength = len(childList)
for identifier in childList:
children = safeElements[safeElements['parent.id'] == identifier]
addToList = np.array(children['id'])
childList = np.append(childList, addToList)
childList = np.unique(childList)
return childList
它有效!但由于我不必只寻找一个问题,所以我需要几分钟的时间来创建我想要的所有孩子的列表。什么是更快的方法?结果不需要出现在子项列表中。我也很高兴有一个经过过滤的DataFrame,它把孩子们的一排排保存到他们的曾孙,等等。
代码中最大的性能瓶颈是数组匹配。搜索数组是O(n)
操作。对另一个数组中的每个元素重复该操作,使操作O(n*m)
。为了获得更快的结果,请查找字典,其查找时间始终为O(1)
。
有一种方法可以做到这一点,而不需要递归。试试这个:
def findChildren(issueId, issueFrame, cache=None):
# Create a dictionary which lists all children an `id` has
# The result is something like this:
# {
# 1: [2,3,4], # 1 has children 2, 3, 4
# 2: [5] # 2 has child 5
# }
_cache = cache or issueFrame[['parent.id', 'id']].groupby('parent.id')['id'].apply(list).to_dict()
# We start with the supplied `issueId`
check_list = [issueId]
# The final list of children
ids = []
while len(check_list) > 0:
parent_id = check_list.pop()
# Get all children of the current `parent_id`...
children = _cache.get(parent_id, [])
# ... then check if they have an children too
check_list += children
# and add them to the list of children for the current ticket
ids += children
# Finally, extract those children from the DataFrame
return issueFrame[issueFrame['id'].isin(ids)]