我正在尝试在pandas
df
中使用assign
值。具体来说,对于下面的df
,我想使用Column['On']
来确定当前发生了多少个值。然后,我想在3
组中分配这些值。所以值;
1-3 = 1
4-6 = 2
7-9 = 3 etc
这可以达到20-30个值。我考虑了NP。但这不是很高效,我正在返回错误。
import pandas as pd
import numpy as np
d = ({
'On' : [1,2,3,4,5,6,7,7,6,5,4,3,2,1],
})
df = pd.DataFrame(data=d)
此通话有效:
df['P'] = np.where(df['On'] == 1, df['On'],1)
但是,如果我想将其应用于其他值,我会收到一个错误:
df = df['P'] = np.where(df['On'] == 1, df['On'],1)
df = df['P'] = np.where(df['On'] == 2, df['On'],1)
df = df['P'] = np.where(df['On'] == 3, df['On'],1)
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
您可以使用系列蒙版和loc
df['P'] = float('nan')
df['P'].loc[(df['On'] >= 1) & (df['On'] <= 3)] = 1
df['P'].loc[(df['On'] >= 4) & (df['On'] <= 6)] = 2
# ...etc
用循环扩展很容易
j = 1
for i in range(1, 20):
df['P'].loc[(df['On'] >= j) & (df['On'] <= (j+2))] = i
j += 3
使用一些基本的数学和矢量化,您可以实现更好的性能。
import pandas as pd
import numpy as np
n = 1000
df = pd.DataFrame({"On":np.random.randint(1,20, n)})
Alexg的解决方案
%%time
j = 1
df["P"] = np.nan
for i in range(1, 20):
df['P'].loc[(df['On'] >= j) & (df['On'] <= (j+2))] = i
j += 3
CPU times: user 2.11 s, sys: 0 ns, total: 2.11 s
Wall time: 2.11 s
建议的解决方案
%%time
df["P"] = np.ceil(df["On"]/3)
CPU times: user 2.48 ms, sys: 0 ns, total: 2.48 ms
Wall time: 2.15 ms
加速度〜1000x