我有一个数据帧,其中包含有关员工工资的信息。大约是 900000+ 行。
样本:
+----+-------------+---------------+----------+
| | table_num | name | salary |
|----+-------------+---------------+----------|
| 0 | 001234 | John Johnson | 1200 |
| 1 | 001234 | John Johnson | 1000 |
| 2 | 001235 | John Johnson | 1000 |
| 3 | 001235 | John Johnson | 1200 |
| 4 | 001235 | John Johnson | 1000 |
| 5 | 001235 | Steve Stevens | 1000 |
| 6 | 001236 | Steve Stevens | 1200 |
| 7 | 001236 | Steve Stevens | 1200 |
| 8 | 001236 | Steve Stevens | 1200 |
+----+-------------+---------------+----------+
dtypes:
table_num: string
name: string
salary: float
我需要添加一个列,其中包含有关增加/减少工资水平的信息。 我正在使用shift()
函数来比较行中的值。
主要问题是在整个数据集上对所有独特员工的过滤和迭代。
在我的脚本中大约需要 3 个半小时。
如何做得更快?
我的脚本:
# giving us only unique combination of 'table_num' and 'name'
# since there can be same 'table_num' for different 'name'
# and same names with different 'table_num' appears sometimes
names_df = df[['table_num', 'name']].drop_duplicates()
# then extracting particular name and table_num from Series
for i in range(len(names_df)): ### Bottleneck of whole script ###
t = names_df.iloc[i,[0,1]][0]
n = names_df.iloc[i,[0,1]][1]
# using shift() and lambda to check if there difference between two rows
diff_sal = (df[(df['table_num']==t)
& ((df['name']==n))]['salary'] - df[(df['table_num']==t)
& ((df['name']==n))]['salary'].shift(1)).apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
df.loc[diff_sal.index, 'inc'] = diff_sal.values
示例输入数据:
df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})
示例输出:
+----+-------------+---------------+----------+-------+
| | table_num | name | salary | inc |
|----+-------------+---------------+----------+-------|
| 0 | 001234 | John Johnson | 1200 | 0 |
| 1 | 001234 | John Johnson | 1000 | -1 |
| 2 | 001235 | John Johnson | 1000 | 0 |
| 3 | 001235 | John Johnson | 1200 | 1 |
| 4 | 001235 | John Johnson | 1000 | -1 |
| 5 | 001235 | Steve Stevens | 1000 | 0 |
| 6 | 001236 | Steve Stevens | 1200 | 0 |
| 7 | 001236 | Steve Stevens | 1200 | 0 |
| 8 | 001236 | Steve Stevens | 1200 | 0 |
+----+-------------+---------------+----------+-------+
将groupby
与diff
一起使用:
df['inc'] = df.groupby(['table_num', 'name'])['salary'].diff().fillna(0.0)
df.loc[df['inc'] > 0.0, 'inc'] = 1.0
df.loc[df['inc'] < 0.0, 'inc'] = -1.0
将DataFrameGroupBy.diff
与numpy.sign
和最后一个强制转换为integer
s:
df['new'] = np.sign(df.groupby(['table_num', 'name'])['salary'].diff().fillna(0)).astype(int)
print (df)
table_num name salary new
0 1234 John Johnson 1200 0
1 1234 John Johnson 1000 -1
2 1235 John Johnson 1000 0
3 1235 John Johnson 1200 1
4 1235 John Johnson 1000 -1
5 1235 Steve Stevens 1000 0
6 1236 Steve Stevens 1200 0
7 1236 Steve Stevens 1200 0
8 1236 Steve Stevens 1200 0
shift()
是要走的路,但你应该尽可能避免使用循环。在这里,我们可以利用groupby()
的力量和transform()
。查看熊猫文档。
在您的情况下,您可以通过编写以下内容来做到这一点:
df.assign(inc=lambda x: x.groupby(['name'])
.salary
.transform(lambda y: y - y.shift(1))
.apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
)
收益 率:
table_num name salary inc
0 001234 John Johnson 1200.0 0
1 001234 John Johnson 1000.0 -1
2 001235 John Johnson 1000.0 0
3 001235 John Johnson 1200.0 1
4 001235 John Johnson 1000.0 -1
5 001235 Steve Stevens 1000.0 0
6 001236 Steve Stevens 1200.0 1
7 001236 Steve Stevens 1200.0 0
8 001236 Steve Stevens 1200.0 0
我认为您可以搜索术语:"熊猫矢量化"以加快数据帧的操作,对于您的问题,您可以尝试以下方法吗:
import pandas as pd
df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})
df['temp'] = df['name'] + df['table_num']
df.sort_values('temp', inplace=True)
df['diff'] = df.groupby('temp')['salary'].diff()
df['diff'] = (df['diff'] / abs(df['diff'])).fillna(0)