如何在熊猫中有条件地求和



我正在尝试使用在不同数据帧上执行的计算结果填充数据帧。 当在两个单独的序列中满足条件时,这些计算应在序列上运行。

这是我尝试过的。 我已经构建了一个数据帧,rswcapacity,应该在其上运行计算,然后创建了另一个数据帧年容量,我希望在其中存储条件计算。

#First DataFrame
d = {'technology': ['EAF', 'EAF', 'EAF', 'BOF', 'BOF', 'BOF'], 'equip_detail1': [150, 130, 100, 200, 200, 150], 'equip_number' : [1, 2, 3, 1, 2, 3], 'capacity_actual': [2400, 2080, 1600, 3200, 3200, 2400], 'start_year': [1992, 1993, 1994, 1989, 1990, 1991], 'closure_year': [ '', 2002, '', '', 2001, 2011] }
rswcapacity = pd.DataFrame(data = d)
rswcapacity['closure_year'].replace('', np.nan, inplace = True)
#Second DataFrame    
annualcapacity = pd.DataFrame(columns=['years', 'capacity'])
annualcapacity ['years'] = [1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
#Neither of the attempts below yields the desired results:
for y in years:
annualcapacity['capacity'].append(rswcapacity['capacity_actual'].apply(lambda x : x['capacity_actual'].sum() (x['start_year'] >= y & (x['closure_year'] <= y | x['closure_year'].isnull()))).sum())
annualcapacity
#other attempt:
for y in years:
if (rswcapacity['start_year'] >= y).any() & ((rswcapacity['closure_year'].isnull()).any() | (rswcapacity['closure_year'] <= y).any()):
annualcapacity['capacity'].append(rswcapacity['capacity_actual'].sum())
annualcapacity

我想得到的结果是每年执行的总和。 例如: 1985 应该返回 NaN,因为 1985 年比 1992 年的任何年份都小start_year 1992 应该返回 14880,因为 1992 比任何start_year都大,比任何closure_year 2001 应返回 7200,因为它大于所有start_year,大于所有closure_years。 相反,我的所有三次尝试都只在年份列表中返回 NaN。

我设置条件有问题,但还没有设法弄清楚是什么。

任何见解都非常感谢!

您可以按如下方式执行此操作:

# start with an empty dataframe for the summed capacity
# with int32 as type of the year and float32 as type for the capacity
annualcapacity = pd.DataFrame({'years': pd.Series(dtype='int32'), 'capacity': pd.Series(dtype='float32')})
# use your list of years
years=  [1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
for y in years:
# create a sum for each year
indexer= (rswcapacity['start_year'] <= y) & ((rswcapacity['closure_year'].isnull()) | (rswcapacity['closure_year'] >= y))
capa= rswcapacity.loc[indexer, 'capacity_actual'].sum()
# and append it to the result frame
annualcapacity= annualcapacity.append(dict(years=y, capacity=capa), ignore_index=True)
annualcapacity

结果如下所示:

years  capacity
0    1980       0.0
1    1981       0.0
2    1982       0.0
3    1983       0.0
4    1984       0.0
5    1985       0.0
6    1986       0.0
7    1987       0.0
8    1988       0.0
9    1989    3200.0
10   1990    6400.0
11   1991    8800.0
12   1992   11200.0
13   1993   13280.0
14   1994   14880.0
15   1995   14880.0
16   1996   14880.0
17   1997   14880.0
18   1998   14880.0
19   1999   14880.0
20   2000   14880.0
21   2001   14880.0
22   2002   11680.0
23   2003    9600.0
24   2004    9600.0
25   2005    9600.0
26   2006    9600.0
27   2007    9600.0
28   2008    9600.0
29   2009    9600.0
30   2010    9600.0
31   2011    9600.0
32   2012    7200.0
33   2013    7200.0
34   2014    7200.0
35   2015    7200.0
36   2016    7200.0
37   2017    7200.0
38   2018    7200.0
39   2019    7200.0
40   2020    7200.0

注意:总和始终是数字,因此如果一年没有容量,则值为 0.0 而不是NaN。如果出于某种原因需要NaN,可以将其替换为以下行。 第二点是,我改变了你的条件,

(rswcapacity['start_year'] >= y) & ((rswcapacity['closure_year'].isnull()) | (rswcapacity['closure_year'] <= y))

所以>=变得<=因为我想,你想把当年所有可用的容量相加,对吧?

因此,如果您需要NaN条目而不是 0.0(如果根本没有可用容量(,则可以执行以下操作:

annualcapacity.loc[annualcapacity['capacity] == 0, 'capacity']= np.NaN

为此,您需要在标头中添加import numpy as np

最新更新