计算 pandas 数据帧中的新列(该列是值的子集)将返回"找不到列"错误



我有兴趣在满足某些条件的数据帧子集上查找列中的值总和,从而在流程中创建新列。我不确定如何处理这两个新列的总和,因为当我尝试访问在此过程中创建的新列时出现错误:

import pandas as pd 
d1={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000],
'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3],
'RUN':[1,1,1,1,2,2,2,2,3,3,3,3]
}
df=pd.DataFrame(d1)
for RUNno in (df.RUN.unique()):
df1=df.RUN==RUNno #Selects the rows matching RUNno
df[df1]['NewColumn']=df[df1]['X']+df[df1]['Y'] #For the selected dataset, calculates the sum of two columns and creates a new column
print(df[df1].NewColumn) #Print the contents of the new column

我无法获得 df[df1]。新列内容,因为它无法识别键新列。我很确定这种创建新列的方法适用于标准数据帧 df,但不确定为什么它不适用于 df[df1]。例如。

df['NewColumn']=df['X']+df['Y'] 
df.NewColumn 

将无缝工作。

为了更新问题,为形成新列而添加的列数据条目来自两个不同的数据帧。

import pandas as pd 
from scipy.interpolate import interp1d 
interpolating_functions=dict() 
d1={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000], 
'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3], 
'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] } 
d2={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000], 
'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3], 
'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] } 
df=pd.DataFrame(d1) 
df2=pd.DataFrame(d2)
for RUNno in (df.RUN.unique()):
df1=df.RUN==RUNno 
df3=df.RUN==RUNno 
interpolating_functions[RUNno]=interp1d(df2[df3].X,df2[df3].Y) 
df[df1]['NewColumn']=df[df1]['X']+interpolating_functions[RUNno](df2[df3]['X']) 
print(df[df1].NewColumn) 

将自定义函数与创建新列一起使用GroupBy.apply然后返回每个组 - 这里x

def func(x):
#check groups
print (x)
#working with groups DataFrame x
x['NewColumn']=x['X']+x['Y']
return x
df = df.groupby('RUN').apply(func)
print (df)
X      Y  RUN  NewColumn
0      1  0.200    1      1.200
1     10  0.500    1     10.500
2    100  0.400    1    100.400
3   1000  1.200    1   1001.200
4      1  0.100    2      1.100
5     10  0.250    2     10.250
6    100  0.200    2    100.200
7   1000  0.600    2   1000.600
8      1  0.050    3      1.050
9     10  0.125    3     10.125
10   100  0.100    3    100.100
11  1000  0.300    3   1000.300

似乎您需要loc通过掩码选择列,在两个数据帧中只需要相同长度的索引:

for RUNno in (df.RUN.unique()):
df1=df.RUN==RUNno 
df3=df.RUN==RUNno 
interpolating_functions[RUNno]=interp1d(df2.loc[df3, 'X'], df2.loc[df3,'Y']) 
df.loc[df1, 'NewColumn'] = df.loc[df1, 'X'] + interpolating_functions[RUNno](df2.loc[df3, 'X']) 
print (df)
X      Y  RUN  NewColumn
0      1  0.200    1      1.200
1     10  0.500    1     10.500
2    100  0.400    1    100.400
3   1000  1.200    1   1001.200
4      1  0.100    2      1.100
5     10  0.250    2     10.250
6    100  0.200    2    100.200
7   1000  0.600    2   1000.600
8      1  0.050    3      1.050
9     10  0.125    3     10.125
10   100  0.100    3    100.100
11  1000  0.300    3   1000.300

相关内容

最新更新