如何有效地创建子索引

我想基于索引为我的数据框架创建一个子索引。例如，我有一个这样的数据帧:

      Content        Date
ID                       
Bob  birthday  2010.03.01
Bob    school  2010.04.01
Tom  shopping  2010.02.01
Tom      work  2010.09.01
Tom   holiday  2010.10.01

我想为我的ID创建一个子索引，结果数据框如下所示:

               Content        Date
ID  subindex                      
Bob 1         birthday  2010.03.01
    2           school  2010.04.01
Tom 1         shopping  2010.02.01
    2             work  2010.09.01
    3          holiday  2010.10.01

要做到这一点，我需要首先创建我的subindex列表。我在帮助文档中搜索，似乎最整洁的方法是使用transform:

subindex = df['Date'].groupby(df.index).transform(lambda x: np.arange(1, len(x) + 1))

然而，它真的很慢。我环顾四周，发现apply也可以做这项工作:

subindex = df['Date'].groupby(df.index).apply(lambda x: np.arange(1, len(x) + 1))

当然subindex需要被扁平化，因为它是一个列表的列表。这比transform方法快得多。然后我用自己的for loop进行了测试:

subindex_size = df.groupby(df.index, sort = False).size()
subindex = []
for i in np.arange(len(subindex_size)):
    subindex.extend(np.arange(1,subindex_size[i]+1))

它甚至更快。对于更大的数据集(大约90k行)，transform方法在我的计算机上大约需要44秒，apply需要~2秒，for loop只需要~1秒。我需要在更大的数据集上工作，所以即使apply和for loop之间的时差对我来说也是不同的。然而，for loop看起来很丑，如果我需要创建其他基于组的变量，可能不容易应用。

所以我的问题是，为什么内置函数应该做正确的事情是慢?是我遗漏了什么，还是有别的原因?还有其他方法可以改进这个过程吗?

您可以使用cumcount来做到这一点:

In [11]: df.groupby(level=0).cumcount()
Out[11]: 
ID
Bob    0
Bob    1
Tom    0
Tom    1
Tom    2
dtype: int64
In [12]: df['subindex'] = df.groupby(level=0).cumcount()  # possibly + 1 here.
In [13]: df.set_index('subindex', append=True)
Out[13]: 
               Content        Date
ID  subindex                      
Bob 0         birthday  2010.03.01
    1           school  2010.04.01
Tom 0         shopping  2010.02.01
    1             work  2010.09.01
    2          holiday  2010.10.01

要从1开始(而不是0)，只需在cumcount的结果上加1。

相关内容

最新更新

热门标签：