我有兴趣在满足某些条件的数据帧子集上查找列中的值总和,从而在流程中创建新列。我不确定如何处理这两个新列的总和,因为当我尝试访问在此过程中创建的新列时出现错误:
import pandas as pd
d1={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000],
'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3],
'RUN':[1,1,1,1,2,2,2,2,3,3,3,3]
}
df=pd.DataFrame(d1)
for RUNno in (df.RUN.unique()):
df1=df.RUN==RUNno #Selects the rows matching RUNno
df[df1]['NewColumn']=df[df1]['X']+df[df1]['Y'] #For the selected dataset, calculates the sum of two columns and creates a new column
print(df[df1].NewColumn) #Print the contents of the new column
我无法获得 df[df1]。新列内容,因为它无法识别键新列。我很确定这种创建新列的方法适用于标准数据帧 df,但不确定为什么它不适用于 df[df1]。例如。
df['NewColumn']=df['X']+df['Y']
df.NewColumn
将无缝工作。
为了更新问题,为形成新列而添加的列数据条目来自两个不同的数据帧。
import pandas as pd
from scipy.interpolate import interp1d
interpolating_functions=dict()
d1={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000],
'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3],
'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] }
d2={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000],
'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3],
'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] }
df=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
for RUNno in (df.RUN.unique()):
df1=df.RUN==RUNno
df3=df.RUN==RUNno
interpolating_functions[RUNno]=interp1d(df2[df3].X,df2[df3].Y)
df[df1]['NewColumn']=df[df1]['X']+interpolating_functions[RUNno](df2[df3]['X'])
print(df[df1].NewColumn)
将自定义函数与创建新列一起使用GroupBy.apply
然后返回每个组 - 这里x
:
def func(x):
#check groups
print (x)
#working with groups DataFrame x
x['NewColumn']=x['X']+x['Y']
return x
df = df.groupby('RUN').apply(func)
print (df)
X Y RUN NewColumn
0 1 0.200 1 1.200
1 10 0.500 1 10.500
2 100 0.400 1 100.400
3 1000 1.200 1 1001.200
4 1 0.100 2 1.100
5 10 0.250 2 10.250
6 100 0.200 2 100.200
7 1000 0.600 2 1000.600
8 1 0.050 3 1.050
9 10 0.125 3 10.125
10 100 0.100 3 100.100
11 1000 0.300 3 1000.300
似乎您需要loc
通过掩码选择列,在两个数据帧中只需要相同长度的索引:
for RUNno in (df.RUN.unique()):
df1=df.RUN==RUNno
df3=df.RUN==RUNno
interpolating_functions[RUNno]=interp1d(df2.loc[df3, 'X'], df2.loc[df3,'Y'])
df.loc[df1, 'NewColumn'] = df.loc[df1, 'X'] + interpolating_functions[RUNno](df2.loc[df3, 'X'])
print (df)
X Y RUN NewColumn
0 1 0.200 1 1.200
1 10 0.500 1 10.500
2 100 0.400 1 100.400
3 1000 1.200 1 1001.200
4 1 0.100 2 1.100
5 10 0.250 2 10.250
6 100 0.200 2 100.200
7 1000 0.600 2 1000.600
8 1 0.050 3 1.050
9 10 0.125 3 10.125
10 100 0.100 3 100.100
11 1000 0.300 3 1000.300