在多个条件下在 For 循环中筛选熊猫数据帧的更快方法



我正在使用一个包含日期和文本数据的大型数据帧(~10M 行),并且我有一个值列表,我需要对该列表中的每个值进行一些计算。

对于每个值,我需要根据 4 个条件过滤/子集数据帧,然后进行计算并继续下一个值。 目前,~80% 的时间花在过滤器块上,这使得处理时间非常长(几个小时)

我目前拥有的是这个:

for val in unique_list:               # iterate on values in a list
if val is not None or val != kip:   # as long as its an acceptable value
for year_num in range(1, 6):      # split by years
# filter and make intermediate df based on per value & per year calculation
cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
cond_2 = df[f'{kip}'].notna()
cond_3 = df['Date'].dt.year < 2015 + year_num
cond_4 = df['Date'].dt.year >= 2015 + year_num -1
temp_df = df[cond_1 & cond_2 & cond_3 & cond_4].copy()

条件 1 大约需要 45% 的时间,而条件 3 和 4 各占 22%

有没有更好的方法来实现这一点?有没有办法删除.dt.str并更快地使用一些东西?

3 个值上的时间(千分之一)

总时间: 16.338 秒

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
1                                           def get_word_counts(df, kip, unique_list):
2                                             # to hold predictors
3         1       1929.0   1929.0      0.0    predictors_df = pd.DataFrame(index=[f'{kip}'])
4         1          2.0      2.0      0.0    n = 0
5                                             
6         3          7.0      2.3      0.0    for val in unique_list:               # iterate on values in a list
7         3          3.0      1.0      0.0      if val is not None or val != kip:   # as long as its an acceptable value
8        18         39.0      2.2      0.0        for year_num in range(1, 6):      # split by years
9                                           
10                                                   # filter and make intermediate df based on per value & per year calculation
11        15    7358029.0 490535.3     45.0          cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
12        15     992250.0  66150.0      6.1          cond_2 = df[f'{kip}'].notna()
13        15    3723789.0 248252.6     22.8          cond_3 = df['Date'].dt.year < 2015 + year_num
14        15    3733879.0 248925.3     22.9          cond_4 = df['Date'].dt.year >= 2015 + year_num -1

数据主要如下所示(我在进行计算时仅使用相关列):

Date    Ingredient
20  2016-07-20  Magnesium
21  2020-02-18  <NA>
22  2016-01-28  Apple;Cherry;Lemon;Olives General;Peanut Butter
23  2015-07-23  <NA>
24  2018-01-11  <NA>
25  2019-05-30  Egg;Soy;Unspecified Egg;Whole Eggs
26  2020-02-20  Chocolate;Peanut;Peanut Butter
27  2016-01-21  Raisin
28  2020-05-11  <NA>
29  2020-05-15  Chocolate
30  2019-08-16  <NA>
31  2020-03-28  Chocolate
32  2015-11-04  <NA>
33  2016-08-21  <NA>
34  2015-08-25  Almond;Coconut
35  2016-12-18  Almond
36  2016-01-18  <NA>
37  2015-11-18  Peanut;Peanut Butter
38  2019-06-04  <NA>
39  2016-04-08  <NA>

所以,看起来你真的只想按'Date'列的年份拆分,并对每个组做一些事情。此外,对于大型df,事先过滤一次可以过滤的内容,然后获取较小的内容(在您的示例中具有一年的数据),然后在较小的df上执行所有循环/提取通常更快。

在不了解数据本身的情况下(C-连续?F-连续?日期排序?),很难确定,但我猜以下内容可能会更快(恕我直言,它也感觉更自然):

# 1. do everything you can outside the loop
# 1.a prep your patterns
escaped_vals = [re.escape(str(val)) for val in unique_list
if val is not None and val != kip]
# you meant 'and', not 'or', right?
# 1.b filter and sort the data (why sort? better mem locality)
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')
# 2. do one groupby by year
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
year = date.year  # optional, if you need it
# 2.b reuse each group as much as possible
for escval in escaped_vals:
mask = dfy[kip].str.contains(escval, na=False)
temp_df = dfy[mask].copy()
# do something with temp_df ...

示例(猜测一些数据,真的):

n = 10_000_000
str_examples = ['hello', 'world', 'hi', 'roger', 'kilo', 'zulu', None]
df = pd.DataFrame({
'Date': [pd.Timestamp('2010-01-01') + k*pd.Timedelta('1 day') for k in np.random.randint(0, 3650, size=n)],
'x': np.random.randint(0, 1200, size=n),
'foo': np.random.choice(str_examples, size=n),
'bar': np.random.choice(str_examples, size=n),
})
unique_list = ['rld', 'oger']
kip = 'foo'
escaped_vals = [re.escape(str(val)) for val in unique_list
if val is not None and val != kip]
%%time
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')
# CPU times: user 1.67 s, sys: 124 ms, total: 1.79 s
%%time
out = defaultdict(dict)
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
year = date.year
for escval in escaped_vals:
mask = dfy[kip].str.contains(escval, na=False)
temp_df = dfy[mask].copy()
out[year].update({escval: temp_df})
# CPU times: user 2.64 s, sys: 0 ns, total: 2.64 s

快速嗅探测试:

>>> out.keys()
dict_keys([2015, 2016, 2017, 2018, 2019])
>>> out[2015].keys()
dict_keys(['rld', 'oger'])
>>> out[2015]['oger'].shape
(142572, 4)
>>> out[2015]['oger'].tail()
Date    x    foo    bar
3354886 2015-12-31  409  roger  hello
8792739 2015-12-31  474  roger   zulu
3944171 2015-12-31  310  roger     hi
7578485 2015-12-31  125  roger   None
2963220 2015-12-31  809  roger     hi

最新更新