熊猫将值与具有过滤条件的前一行进行比较

我有一个数据帧，其中包含有关员工工资的信息。大约是 900000+ 行。

样本：

+----+-------------+---------------+----------+
|    |   table_num | name          |   salary |
|----+-------------+---------------+----------|
|  0 |      001234 | John Johnson  |     1200 |
|  1 |      001234 | John Johnson  |     1000 |
|  2 |      001235 | John Johnson  |     1000 |
|  3 |      001235 | John Johnson  |     1200 |
|  4 |      001235 | John Johnson  |     1000 |
|  5 |      001235 | Steve Stevens |     1000 |
|  6 |      001236 | Steve Stevens |     1200 |
|  7 |      001236 | Steve Stevens |     1200 |
|  8 |      001236 | Steve Stevens |     1200 |
+----+-------------+---------------+----------+

dtypes：

table_num: string
name: string
salary: float

我需要添加一个列，其中包含有关增加/减少工资水平的信息。我正在使用shift()函数来比较行中的值。

主要问题是在整个数据集上对所有独特员工的过滤和迭代。

在我的脚本中大约需要 3 个半小时。

如何做得更快？

我的脚本：

# giving us only unique combination of 'table_num' and 'name'
# since there can be same 'table_num' for different 'name'
# and same names with different 'table_num' appears sometimes
names_df = df[['table_num', 'name']].drop_duplicates()
# then extracting particular name and table_num from Series
for i in range(len(names_df)):    ### Bottleneck of whole script ###    
t = names_df.iloc[i,[0,1]][0]
n = names_df.iloc[i,[0,1]][1]
# using shift() and lambda to check if there difference between two rows 
diff_sal = (df[(df['table_num']==t)
& ((df['name']==n))]['salary'] - df[(df['table_num']==t)
& ((df['name']==n))]['salary'].shift(1)).apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
df.loc[diff_sal.index, 'inc'] = diff_sal.values

示例输入数据：

df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'], 
'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'], 
'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})

示例输出：

+----+-------------+---------------+----------+-------+
|    |   table_num | name          |   salary |   inc |
|----+-------------+---------------+----------+-------|
|  0 |      001234 | John Johnson  |     1200 |     0 |
|  1 |      001234 | John Johnson  |     1000 |    -1 |
|  2 |      001235 | John Johnson  |     1000 |     0 |
|  3 |      001235 | John Johnson  |     1200 |     1 |
|  4 |      001235 | John Johnson  |     1000 |    -1 |
|  5 |      001235 | Steve Stevens |     1000 |     0 |
|  6 |      001236 | Steve Stevens |     1200 |     0 |
|  7 |      001236 | Steve Stevens |     1200 |     0 |
|  8 |      001236 | Steve Stevens |     1200 |     0 |
+----+-------------+---------------+----------+-------+

将groupby与diff一起使用：

df['inc'] = df.groupby(['table_num', 'name'])['salary'].diff().fillna(0.0)
df.loc[df['inc'] > 0.0, 'inc'] = 1.0
df.loc[df['inc'] < 0.0, 'inc'] = -1.0

将DataFrameGroupBy.diff与numpy.sign和最后一个强制转换为integers：

df['new'] = np.sign(df.groupby(['table_num', 'name'])['salary'].diff().fillna(0)).astype(int)
print (df)
table_num           name  salary  new
0       1234   John Johnson    1200    0
1       1234   John Johnson    1000   -1
2       1235   John Johnson    1000    0
3       1235   John Johnson    1200    1
4       1235   John Johnson    1000   -1
5       1235  Steve Stevens    1000    0
6       1236  Steve Stevens    1200    0
7       1236  Steve Stevens    1200    0
8       1236  Steve Stevens    1200    0

shift()

是要走的路，但你应该尽可能避免使用循环。在这里，我们可以利用groupby()的力量和transform()。查看熊猫文档。

在您的情况下，您可以通过编写以下内容来做到这一点：

df.assign(inc=lambda x: x.groupby(['name'])
.salary
.transform(lambda y: y - y.shift(1))
.apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
)

收益率：

table_num   name       salary   inc
0   001234  John Johnson    1200.0  0
1   001234  John Johnson    1000.0  -1
2   001235  John Johnson    1000.0  0
3   001235  John Johnson    1200.0  1
4   001235  John Johnson    1000.0  -1
5   001235  Steve Stevens   1000.0  0
6   001236  Steve Stevens   1200.0  1
7   001236  Steve Stevens   1200.0  0
8   001236  Steve Stevens   1200.0  0

我认为您可以搜索术语："熊猫矢量化"以加快数据帧的操作，对于您的问题，您可以尝试以下方法吗：

import pandas as pd
df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})
df['temp'] = df['name'] + df['table_num']
df.sort_values('temp', inplace=True)
df['diff'] = df.groupby('temp')['salary'].diff()
df['diff'] = (df['diff'] / abs(df['diff'])).fillna(0)

相关内容

最新更新

热门标签：