Pandas数据帧在列循环方面效率低下



我有每个单元格和日期的降水数据(1800行,15k列)。

486335  486336  486337
2019-07-03 13:35:54.445       0       2      22
2019-07-04 13:35:54.445       0       1       1
2019-07-05 13:35:54.445      16       8      22
2019-07-06 13:35:54.445       0       0       0
2019-07-07 13:35:54.445       0      11       0

我想找到达到特定降雨量(>15mm)的日期,并计算该事件发生后降雨量较少(<1,1mm)的天数。连同降雨量、开始和结束周期、单元格和其他信息一起存储在一个新的DataFrame中。

我写了一个for循环来完成这项工作,但花了几天时间才完成;(.我是python的初学者,所以可能还有一些其他方法的提示。

from datetime import datetime, timedelta, date
import datetime
import pandas as pd
#Existing Data
index_dates =  pd.date_range(pd.datetime.today(), periods=10).tolist()
df = pd.DataFrame({'486335':[0,0,16,0,0,0,2,1,8,2],'486336':[2,1,8,0,11,16,0,1,6,8],'486337':[22,1,22,0,0,0,5,3,6,1]},index=index_dates)
columns = df.columns 
counter_columns = 0
iteration = -1 #Iterations Steps
counter = 10 #10 precipitation values per column
duration = 0 #days with no or less than pp_max_1 rain 
count = False
index_list = df.index #Index for updating df / Integear
period_range = 0  #Amount of days after Event without much rain Integear
period_amount = 0 #Amount of PP in dry days except event Integear
event_amount = 0.0  #Amount of heavy rainfall on the event date Float
pp = 0 #actual precipitation
pp_sum = 0.0 #mm
pp_min = 15.0 #mm min pp for start to count dry days until duration_min_after
pp_max_1 = 0.11 #max pp for 1 day while counting dry days
dry_days = 0 #dry days after event
for x in df:
for y in df[x]:
iteration = iteration + 1
if iteration == counter:
iteration = 0
counter_columns = counter_columns + 1
print("column :",counter_columns, "finished")
if y >= pp_min and count == False:
duration = duration + 1
count = True
start_period = index_list[iteration]
event_amount = y
index = iteration
pp_sum = pp_sum + y
elif y >= pp_min and count == True or y >= pp_max_1 and count == True:
end_period = index_list[iteration]
dry_periods = dry_periods.append({"start_period":start_period ,"end_period":end_period,"period_range":duration,"period_amount":pp_sum ,"event_amount":event_amount, "cell":columns[counter_columns]},ignore_index=True).sort_values('period_range',ascending=False)
duration = 0
count = False
pp_sum = 0
elif pp <= pp_max_1 and count == True:
duration = duration + 1
pp_sum = pp_sum + y
else:
continue
print(dry_periods)

输出看起来像这个

start_period              end_period period_range  
0  2019-07-05 13:15:05.545 2019-07-09 13:15:05.545            4   
1  2019-07-05 13:15:05.545 2019-07-09 13:15:05.545            4   
2  2019-07-05 13:15:36.569 2019-07-09 13:15:36.569            4   
3  2019-07-05 13:15:36.569 2019-07-09 13:15:36.569            4   
4  2019-07-05 13:16:16.372 2019-07-09 13:16:16.372            4   
5  2019-07-05 13:16:16.372 2019-07-09 13:16:16.372            4   

period_amount event_amount    cell  
0            16.0           16  486335  
1            22.0           22  486337  
2            16.0           16  486335  
3            22.0           22  486337  
4            16.0           16  486335  
5            22.0           22  486337  

您可以避免行上的迭代,因为它不能很好地扩展大型数据帧。

这是一种不同的方法,不确定它是否会对您的完整数据帧更有效:

periods=[]
for cell in df.columns:
sub = pd.DataFrame({'amount': df[cell].values}, index=df.index)
sub['flag'] = pd.cut(sub['amount'], [0.11, 15, np.inf],
labels=[0, 1]).astype(np.float)
sub.loc[sub.flag>0, 'flag']=sub.loc[sub.flag>0, 'flag'].cumsum()
sub.flag.ffill(inplace=True)
x = sub[sub.flag>0].reset_index().groupby('flag').agg(
{'index':['min', 'max'], 'amount': 'sum'})
x.columns = ['start', 'end', 'amount']
x['period_range'] = (x.end - x.start).dt.days + 1
x['cell'] = cell
x.reindex(columns=['start', 'end', 'period_range', 'cell'])
periods.append(x)
resul = pd.concat(periods).reset_index(drop=True)

因为我没有你的整个数据集,我真的不能说是什么消耗了时间,但我想这是因为索引访问,当你获取周期和在循环中执行的排序操作时。也许你想试试下面的代码。它在逻辑上应该与您的代码等效,除了一些更改:

duration = 0 #days with no or less than pp_max_1 rain 
count = False
index_list = df.index #Index for updating df / Integear
period_range = 0  #Amount of days after Event without much rain Integear
period_amount = 0 #Amount of PP in dry days except event Integear
event_amount = 0.0  #Amount of heavy rainfall on the event date Float
pp = 0 #actual precipitation
pp_sum = 0.0 #mm
pp_min = 15.0 #mm min pp for start to count dry days until duration_min_after
pp_max_1 = 0.11 #max pp for 1 day while counting dry days
dry_days = 0 #dry days after event
dry_periods= list()
for counter_columns, column in enumerate(df.columns, 1):
for period, y in df[column].items():
if not count and y >= pp_min:
duration += 1
count = True
start_period = period
event_amount = y
pp_sum += y
elif count and (y >= pp_min or y >= pp_max_1):
end_period = period
dry_periods.append({
"start_period":  start_period ,
"end_period":    end_period,
"period_range":  duration,
"period_amount": pp_sum ,
"event_amount":  event_amount, 
"cell":          column})
duration = 0
count =    False
pp_sum =   0
elif count and pp <= pp_max_1:
duration += 1
pp_sum   += y
print("column :",counter_columns, "finished")
dry_periods.sort(key=lambda record: record['period_range'])
print(dry_periods)

变化是:

  • 删除了index_list〔迭代〕访问,我认为这可能会花费一些时间
  • 删除了整个迭代计数器逻辑,因为与之相关的逻辑可以放在内部循环之外,这样内部循环就会变小,尽管它可能不会真正提高性能
  • 比较计数==True不是必需的,您可以直接在if子句中写入计数
  • 将递增和求和逻辑从var=var+num更改为var+=num(这可能是品味问题,如果你愿意,你也可以跳过这个,它不会对性能产生太大影响)
  • 然后我把dry_periods的排序逻辑放在循环之外,因为在我看来,你的循环逻辑不依赖于要排序的集合-->也许这甚至是对性能的最大影响

Btw。因为我不知道dry_periods是如何定义的,所以我只是把它当作一个列表。还请查看条件

elif count and (y >= pp_min or y >= pp_max_1):

以上。在我看来这很可疑,但这只是你程序中重写的条件。如果可以的话,也许你可以删除其中一个比较,因为我想pp_min<pp_max_1,对吗?

最新更新