数据框架pandas中跨列移动数据的更快方式



我有一个数据帧分为变量(a,b)和时间值(1,5)。数据框的列是变量和时间值(&;a_1&;)的组合。然而,我需要把这些绝对的时间值转换成相对值。为此,我有另一个带有参考指示符的数据帧,指示要移动多少时间值。

因此,我想根据引用指示符移动值的位置,该指示符由另一个数据帧表示,并且根据索引变化。

EX:如果对于一个特定的索引,参考指标是3,我希望该索引中的数据向左移动,直到位置3变为a_1(所以它移动了2(3-1)个位置),例如:原:

a_1       a_2       a_3      a_4       a_5
0  0.854592  0.677819  0.071725  0.29312  0.948375

转移:

a_1      a_2       a_3  a_4  a_5
0  0.071725  0.29312  0.948375  NaN  NaN

我已经创建了下面的代码,成功地实现了预期的结果,但是它需要很多时间来计算(我实际上是用100k索引进行测试)。

我将非常感谢任何优化代码的帮助。

复制代码:

import numpy as np
import pandas as pd
# main data to be shifted
var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(1000,10),index=range(0,1000))
df_example.columns = [var_name +"_"+str(j) for var_name in var_names for j in range(1, 6)]
# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (1000,1)),index=range(0,1000), columns = ['moving_indicator'])
list_vars_shifted = []
for var in var_names:
df_vars_shifted = pd.concat([df_ref.loc[:,'moving_indicator'],
df_example.filter(like=var)], axis = 1)

# Shift accoording to month indicator (hence +1) - SLOW
df_vars_shifted = (df_vars_shifted.apply(lambda x : x.shift(-(int(x['moving_indicator']))+1) , axis=1)
.drop(columns=['moving_indicator']))

list_vars_shifted.append(df_vars_shifted)
# Convert to dataframe
df_all_vars_shifted = pd.concat(list_vars_shifted, axis=1)

这个怎么样?我没有进行计时测试,因为我没时间了。我输出了一些循环数据帧的打印结果来显示正在发生的事情。我将移动指示器改为0表示不移动,周期=可以是0,所以它不会移位。根据数据的不同,.replace可能是危险的,所以它有点粗糙。

import numpy as np
import pandas as pd
# main data to be shifted
df_a = pd.DataFrame(np.random.rand(1000,5),index=range(0,1000))
df_a.columns = [
var_name +"_"+str(j) for var_name in ['a'] for j in range(1, 6)
]
df_b = pd.DataFrame(np.random.rand(1000,5),index=range(0,1000))
df_b.columns = [
var_name +"_"+str(j) for var_name in ['b'] for j in range(1, 6)
]
# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(
np.random.randint(0, 4, size=(1000,1)),
index=range(0,1000),
columns=['moving_indicator']
)
df_a = df_a.merge(df_ref, how='inner', left_index=True, right_index=True)
df_grp = df_a.groupby('moving_indicator')
new_df_a = pd.DataFrame([])
for indicator, gdf in df_grp:
indicator
indicator = indicator * -1
gdf.shift(periods=indicator, axis=1)
gdf = gdf.shift(periods=indicator, axis=1)
new_df_a = pd.concat([new_df_a, gdf])

new_df_a = new_df_a.sort_index()
new_df_a = (
new_df_a.replace({3: np.nan, 2: np.nan, 1: np.nan})
.drop('moving_indicator', axis=1)
)
df_b = df_b.merge(df_ref, how='inner', left_index=True, right_index=True)
df_grp = df_b.groupby('moving_indicator')
new_df_b = pd.DataFrame([])
for indicator, gdf in df_grp:
indicator
indicator = indicator * -1
gdf.shift(periods=indicator, axis=1)
gdf = gdf.shift(periods=indicator, axis=1)
new_df_b = pd.concat([new_df_b, gdf])

new_df_b = new_df_b.sort_index()
new_df_b = (
new_df_b.replace({3: np.nan, 2: np.nan, 1: np.nan})
.drop('moving_indicator', axis=1)
)
final_df = new_df_a.merge(
new_df_b, how='inner', left_index=True, right_index=True
)

编辑:以下是时间安排。版本问题:

>>> print(timeit.repeat(dummy, repeat=5, number=1))
[0.1520585000034771, 0.1450397999942652, 0.1416596999988542,
0.14743759999691974, 0.14560850000270875]

我的版本:

>>> print(timeit.repeat(my_func, repeat=5, number=1))
[0.022981900001468603, 0.0159782000046107, 0.01633900000160793,
0.015842399996472523, 0.01663669999834383]

我尝试了不同的方法,最好的方法是使用列表推导+ shift+ dataframe.where():

var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(10000,20),index=range(0,10000))
df_example.columns = [var_name +"_"+str(j) for var_name in var_names for j in range(1, 11)]
# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (10000,1)),index=range(0,10000), columns = ['moving_indicator'])
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
list_shifted_variables = [df_vars.shift(-(indicator)+1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all') for indicator in np.unique(df_vars['moving_indicator'])]
df_vars_shifted = pd.concat(list_shifted_variables).sort_index().drop(columns=['moving_indicator'])
list_vars_shifted.append(df_vars_shifted)
df_all_vars_shifted_6 = pd.concat(list_vars_shifted, axis=1)

所有不同方法的完整代码:

import numpy as np
import pandas as pd
import swifter
import time
t1 = time.process_time()
# main data to be shifted
var_names = ['a','b']
df_example = pd.DataFrame(np.random.rand(10000,20),index=range(0,10000))
df_example.columns = [var_name +"_"+str(j) for var_name in var_names for j in range(1, 11)]
# reference index to determine how many places to be shifted
df_ref = pd.DataFrame(np.random.randint(1,5, size = (10000,1)),index=range(0,10000), columns = ['moving_indicator'])
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'],
df_example.filter(like=var)], axis = 1)

# Shift accoording to month indicator (hence +1) - SLOW
df_vars_shifted = (df_vars.apply(lambda x : x.shift(-(int(x['moving_indicator']))+1) , axis=1)
.drop(columns=['moving_indicator']))

list_vars_shifted.append(df_vars_shifted)
# Convert to dataframe
df_all_vars_shifted = pd.concat(list_vars_shifted, axis=1)

elapsed_time1 = time.process_time() - t1
print(elapsed_time1)


t2 = time.process_time()
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'],
df_example.filter(like=var)], axis = 1)

# Shift accoording to month indicator (hence +1) - SLOW
df_vars_shifted = (df_vars.swifter.apply(lambda x : x.shift(-(int(x['moving_indicator']))+1) , axis=1)
.drop(columns=['moving_indicator']))

list_vars_shifted.append(df_vars_shifted)
# Convert to dataframe
df_all_vars_shifted_2 = pd.concat(list_vars_shifted, axis=1)

elapsed_time2 = time.process_time() - t2
print(elapsed_time2)

t3 = time.process_time()
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)

# Shift accoording to month indicator (hence +1) - SLOW
df_vars_shifted = pd.DataFrame([df_vars.iloc[i].shift(-(int(df_vars.iloc[i,0]))+1) for i in range(len(df_vars))]).drop(columns=['moving_indicator'])

list_vars_shifted.append(df_vars_shifted)
# Convert to dataframe
df_all_vars_shifted_3 = pd.concat(list_vars_shifted, axis=1)

elapsed_time3 = time.process_time() - t3
print(elapsed_time3)

t4 = time.process_time()
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)

# Shift accoording to month indicator (hence +1) - SLOW
df_vars_shifted = pd.DataFrame(row[1].shift(-(int(row[1]['moving_indicator']))+1) for row in df_vars.iterrows()).drop(columns=['moving_indicator'])

list_vars_shifted.append(df_vars_shifted)
# Convert to dataframe
df_all_vars_shifted_4 = pd.concat(list_vars_shifted, axis=1)

elapsed_time4 = time.process_time() - t4
print(elapsed_time4)

t5 = time.process_time()
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
list_test = []
for indicator in np.unique(df_vars['moving_indicator']):
df_test10 = df_vars.shift(-(indicator)+1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all')
list_test.append(df_test10)

df_vars_shifted = pd.concat(list_test).sort_index().drop(columns=['moving_indicator'])
list_vars_shifted.append(df_vars_shifted)
df_all_vars_shifted_5 = pd.concat(list_vars_shifted, axis=1)
elapsed_time5 = time.process_time() - t5
print(elapsed_time5)

t6 = time.process_time()
list_vars_shifted = []
for var in var_names:
df_vars = pd.concat([df_ref.loc[:,'moving_indicator'], df_example.filter( like = var )], axis = 1)
list_shifted_variables = [df_vars.shift(-(indicator)+1, axis = 1).where(indicator == df_vars['moving_indicator']).dropna( how = 'all') for indicator in np.unique(df_vars['moving_indicator'])]
df_vars_shifted = pd.concat(list_shifted_variables).sort_index().drop(columns=['moving_indicator'])
list_vars_shifted.append(df_vars_shifted)
df_all_vars_shifted_6 = pd.concat(list_vars_shifted, axis=1)
elapsed_time6 = time.process_time() - t6
print(elapsed_time6)

相关内容

  • 没有找到相关文章

最新更新