pandas聚合器.first()和.last()之间的区别

我很好奇last()和first()在这个特定实例中做了什么(当链接到重采样时(。如果我错了，请纠正我，但如果你把论点分为第一个和最后一个，例如3；它返回前3个月或前3年。

在这种情况下，既然我没有向first()和last()传递任何参数，那么当我这样重新采样时，它实际上在做什么？我知道，如果我通过链接.mean()重新采样，我会用所有月份的平均分重新采样到年份，但当我使用last()时会发生什么？

更重要的是，为什么first()和last()在这种情况下给了我不同的答案？我看到他们在数字上是不相等的。

即：post2008.resample().first() != post2008.resample().last()

TLDR:

.first()和.last()做什么
在这种情况下，.first()和.last()在链接到重采样时会做什么
为什么.resample().first() != .resample().last()

这是聚合前的代码：

# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', index_col='DATE', parse_dates=True)
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]
# Print the last 8 rows of post2008
print(post2008.tail(8))

这就是print(post2008.tail(8))输出的内容：

VALUE
DATE               
2014-07-01  17569.4
2014-10-01  17692.2
2015-01-01  17783.6
2015-04-01  17998.3
2015-07-01  18141.9
2015-10-01  18222.8
2016-01-01  18281.6
2016-04-01  18436.5

以下是通过last():重新采样和聚合的代码

# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
print(yearly)

这就是每年post2008.resample('A').last():的情况

VALUE
DATE               
2008-12-31  14549.9
2009-12-31  14566.5
2010-12-31  15230.2
2011-12-31  15785.3
2012-12-31  16297.3
2013-12-31  16999.9
2014-12-31  17692.2
2015-12-31  18222.8
2016-12-31  18436.5

以下是通过first():重新采样和聚合的代码

# Resample post2008 by year, keeping first(): yearly
yearly = post2008.resample('A').first()
print(yearly)

这就是每年post2008.resample('A').first():的情况

VALUE
DATE               
2008-12-31  14668.4
2009-12-31  14383.9
2010-12-31  14681.1
2011-12-31  15238.4
2012-12-31  15973.9
2013-12-31  16475.4
2014-12-31  17025.2
2015-12-31  17783.6
2016-12-31  18281.6

首先，让我们用示例数据创建一个数据帧：

import pandas as pd
dates = pd.DatetimeIndex(['2014-07-01', '2014-10-01', '2015-01-01',
'2015-04-01', '2015-07-01', '2015-07-01',
'2016-01-01', '2016-04-01'])
df = pd.DataFrame({'VALUE': range(1000, 9000, 1000)}, index=dates)
print(df)

输出将是

VALUE2014-07-01 10002014-10-01 20002015-01-01 30002015年4月1日40002015-07-01 50002015-07-01 60002016-01-01 70002016-04-01 8000

如果我们将例如'6M'传递给df.first(它不是聚合器，而是DataFrame方法(，我们将选择前六个月的数据，在上面的例子中，它只包括两天：

print(df.first('6M'))

VALUE2014-07-01 10002014-10-01 2000

类似地，last只返回属于最后六个月数据的行：

print(df.last('6M'))

VALUE2016-01-01 60002016-04-01 7000

在这种情况下，不传递所需的参数会导致错误：

print(df.first())

TypeError:first((缺少1个必需的位置参数："offset">

另一方面，df.resample('Y')返回一个重采样器对象，该对象具有聚合方法first、last、mean等。在这种情况下，它们只保留每年的第一个(分别为最后一个(值(而不是例如对所有值取平均值，或其他某种聚合(：

print(df.resample('Y').first())

VALUE2014年12月31日10002015-12-31 3000#这是2015年4个值中的第一个2016-12-31 7000

print(df.resample('Y').last())

VALUE2014年12月31日20002015-12-31 6000#这是2015年4个值中的最后一个2016-12-31 8000

作为一个额外的例子，还可以考虑按较小周期分组的情况：

print(df.resample('M').last().head())

VALUE2014-07-31 1000.0#这是2014年7月的最后一个(也是唯一一个(值2014-08-31 NaN#2014年8月无数据2014-09-30 NaN#2014年9月无数据2014-10-31 2000.02014-11-30 NaN#没有2014年11月的数据

在这种情况下，任何没有值的期间都将填充NaN。此外，对于这个示例，使用first而不是last将返回相同的值，因为每个月(最多(有一个值。

相关内容

最新更新

热门标签：