计算pandas框架中匹配条件的行数(如果可能的话，使用数据的排序)

数据集描述：

我有一个看起来像这样的数据集(按关键字排序，这也意味着它按offset_time_stamp排序(：

key       offset_time_stamp       person          7_sec_count
1         0                       A               0
2         0                       B               0
3         0                       A               1
4         1                       A               2
5         2                       B               1
6         7                       A               1
7         9                       B               0

行数约为2000万。独特的人数约为400万(每个人在过去7秒内的记录从0到10k+不等(。

我想计算过去7秒内该人出现的行数(使用offset_time_stamp 计算

以下是我尝试过的：

def get_count(x):
  return [data[the_condition].count() for row in x.iterrows()]
counts = data.groupby(person).apply(get_count)

这个代码运行了大约6个小时。我想在1秒内重新采样，但它不起作用，因为数据集在那一秒内有几行同一个人的数据，而且我没有微秒级的数据。当时间戳相同时，我需要使用键值来解决关系。

我现在想做什么

我现在想用1亿行重新进行同样的练习，并将7秒窗口增加到7000秒现有代码预计将在5天内运行

如何使计算速度更快？我希望它能在2-3小时内完成运行，这样我就可以分析更大的数据集。我可以将解决方案移植到另一种语言，或者使用纯粹基于numpy的解决方案而不是panda。

我还可以使用数据的排序性质吗我在按人分组并减去刚好落在7秒窗口外的最后一行的累计数后尝试了累计数。它花了太长时间，可能是因为我写它的方式：

data_cumcount = data.groupby(person).cumcount() + 1
def get_subtractor(current_ts, person, index):
    global data_cumcount
    to_consider = sorted(person_level_indices[person][:person_level_indices[person].index(index)], reverse=True)
    for index in to_consider:
        if data_dict[index] <= current_ts: # -7 was already done and passed to this function
            return data_cumcount[index]
    return 0
def get_count(x):
    global data_cumcount
    return data_cumcount[x.name] - get_subtractor(x[TIME_STAMP]-7, x[person], x.name) - 1
counts = data.apply(get_count, axis=1)

我猜get_subtractor中的for循环导致它在这种方法中花费了很多时间。

我已经尝试了一些其他方法，包括递归，但鉴于我的代码似乎不那么高效，data.groupby(person).apply(get_count)是表现最好的方法。

编辑：使用建议的时间戳的排序性，以及searchsorted是矢量化的，我能够在3中缩短执行时间。

def get_count(dfg):
    return pd.Series(
        np.arange(len(dfg)) - dfg.offset_time_stamp.searchsorted(dfg.offset_time_stamp - 6),
        index=dfg.index
    )
df['count'] = df.groupby('person').apply(get_count).reset_index(level=0, drop=True)

注意，searchsorted返回的不是序列的索引，而是底层numpy数组中的位置。因此，计数只是当前位置(即np.arange(和searchsorted返回的位置之间的差。

将结果转换为一个系列是有开销的，但我需要它来在最后重新分配结果。如果你找到了一种直接使用numpy值的方法，你可以再次将时间减半。

原始答案：

可能有一个类似于你提到的cumcount的解决方案，但与此同时，这可能已经更快了：

def get_count(dfg):
    return dfg.apply(lambda row: dfg[dfg.offset_time_stamp > row['offset_time_stamp'] - 7].loc[:row.name-1].count(), axis=1)
df.groupby('person').apply(get_count)

我使用本机切片和选择，而不是iterrows循环。请注意，当将函数逐行应用于数据帧时，row.name是原始数据帧的索引。

相关内容

最新更新

热门标签：