Python Pandas在满足时间序列的条件时聚合过去的行



我有一个时间序列问题,我想根据某个列中出现的值聚合一些数据。为了说明这一点,请考虑下面的表

<表类> 日期 可乐 colB colC tbody><<tr>2019-01-011-10空2019-01-0225空2019-01-03301012019-01-04451012019-01-055101012019-01-06615空2019-01-07720101

Try withgroupby:

#convert Date column to datetime if needed
df["Date"] = pd.to_datetime(df["Date"])
#keep only rows where there aren't consecutive non-null values
df2 = df[~(df["colC"].notnull()&df["colC"].shift().notnull())]
#groupby consecutive null values and aggregate
output = df2.groupby(df2["colC"].notnull().shift().cumsum().fillna(0)) 
.agg({"colA": "mean", 
"colB": "mean", 
"colC": "first", 
"Date": lambda x: (x.max()-x.min()).days}) 
.rename_axis(None) 
.rename(columns={"Date": "Delta"})
>>> output
colA  colB   colC  Delta
0.0   2.0  -5.0  101.0      2
1.0   6.5  17.5  101.0      1

可以用cumsum()设置colC的组,然后用.groupby()分组,如下:

(假设Null值为NaN或None):

# Convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'])
# disregard rows if the element X of colC is not null but the element (X-1) is also not null
df2 = df.loc[df['colC'].isna() | df['colC'].shift().isna()]
# set grouping for `colC`
group = (df2['colC'].shift(1).notna() & df2['colC'].isna()).cumsum()
df_out = (df2.groupby(group, as_index=False)
.agg(**{'agg(colC)':('colC', 'last'), 
'avg(colA)':('colA', 'mean'),
'avg(colB)':('colB', 'mean'), 
'delta(Date)':('Date', lambda x: (x.iloc[-1] - x.iloc[0]).days)})
)

数据输入:

data = {'Date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06', '2019-01-07'],
'colA': [1, 2, 3, 4, 5, 6, 7],
'colB': [-10, -5, 0, 5, 10, 15, 20],
'colC': [np.nan, np.nan, 101.0, 101.0, 101.0, np.nan, 101.0]}
df = pd.DataFrame(data)
Date  colA  colB   colC
0  2019-01-01     1   -10    NaN
1  2019-01-02     2    -5    NaN
2  2019-01-03     3     0  101.0
3  2019-01-04     4     5  101.0
4  2019-01-05     5    10  101.0
5  2019-01-06     6    15    NaN
6  2019-01-07     7    20  101.0

结果:

print(df_out)
agg(colC)  avg(colA)  avg(colB)  delta(Date)
0      101.0        2.0       -5.0            2
1      101.0        6.5       17.5            1

说明/说明

让我们添加更多数据以更好地说明:

data = {'Date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-08', '2019-01-09', '2019-01-10', '2019-01-11', '2019-01-12', '2019-01-13', '2019-01-14', '2019-01-15'],
'colA': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 
'colB': [-10, -5, 0, 5, 10, 15, 15, 15, 20, 20, 15, 15, 15, 20, 20],
'colC': [np.nan, np.nan, 101, 101, 101, np.nan, np.nan, 102, 103, 104, np.nan, np.nan, np.nan, 112, 113]}
df = pd.DataFrame(data)

Date  colA  colB   colC
0   2019-01-01     1   -10    NaN
1   2019-01-02     2    -5    NaN
2   2019-01-03     3     0  101.0
3   2019-01-04     4     5  101.0
4   2019-01-05     5    10  101.0
5   2019-01-06     6    15    NaN
6   2019-01-07     7    15    NaN
7   2019-01-08     8    15  102.0
8   2019-01-09     9    20  103.0
9   2019-01-10    10    20  104.0
10  2019-01-11    11    15    NaN
11  2019-01-12    12    15    NaN
12  2019-01-13    13    15    NaN
13  2019-01-14    14    20  112.0
14  2019-01-15    15    20  113.0

在日期格式转换之后,我们通过只保留colC上的NaN或当前行不是NaN但其前一行是colC上的NaN来丢弃不需要的行。

# Convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'])
df2 = df.loc[df['colC'].isna() | df['colC'].shift().isna()]

结果:

print(df2)
Date  colA  colB   colC
0  2019-01-01     1   -10    NaN
1  2019-01-02     2    -5    NaN
2  2019-01-03     3     0  101.0
5  2019-01-06     6    15    NaN
6  2019-01-07     7    15    NaN
7  2019-01-08     8    15  102.0
10 2019-01-11    11    15    NaN
11 2019-01-12    12    15    NaN
12 2019-01-13    13    15    NaN
13 2019-01-14    14    20  112.0

现在,对于设置组,为了说明目的,让我们通过创建中间列来显示数据框中的中间列值:

df2['Group_indicator'] = df2['colC'].shift(1).notna() & df2['colC'].isna()

我们设置布尔掩码,使组的第一个条目被设置为True,其他条目被设置为False(第一行有例外,但当我们为组生成序列号时没有问题)

结果:

Date  colA  colB   colC  Group_indicator
0  2019-01-01     1   -10    NaN            False
1  2019-01-02     2    -5    NaN            False
2  2019-01-03     3     0  101.0            False
5  2019-01-06     6    15    NaN             True
6  2019-01-07     7    15    NaN            False
7  2019-01-08     8    15  102.0            False
10 2019-01-11    11    15    NaN             True
11 2019-01-12    12    15    NaN            False
12 2019-01-13    13    15    NaN            False
13 2019-01-14    14    20  112.0            False

然后,我们生成组号:

df2['Group'] = df2['Group_indicator'].cumsum()

结果:

Date  colA  colB   colC  Group_indicator  Group
0  2019-01-01     1   -10    NaN            False      0
1  2019-01-02     2    -5    NaN            False      0
2  2019-01-03     3     0  101.0            False      0
5  2019-01-06     6    15    NaN             True      1
6  2019-01-07     7    15    NaN            False      1
7  2019-01-08     8    15  102.0            False      1
10 2019-01-11    11    15    NaN             True      2
11 2019-01-12    12    15    NaN            False      2
12 2019-01-13    13    15    NaN            False      2
13 2019-01-14    14    20  112.0            False      2

您可以看到,每个组都生成了组的序列号,同一组的所有条目的组号相同。

之后,我们使用这个组号分组,并使用代码聚合行:

df_out = (df2.groupby('Group', as_index=False)
.agg(**{'agg(colC)':('colC', 'last'), 
'avg(colA)':('colA', 'mean'),
'avg(colB)':('colB', 'mean'), 
'delta(Date)':('Date', lambda x: (x.iloc[-1] - x.iloc[0]).days)})
)
这里,我们感兴趣的colC条目是最后一个 在组内,我们在colC上聚合'last'

。对于Date上的聚合得到delta(Date),我们取Datex.iloc[-1]/x.iloc[0]组的最后/第一个条目(即Date在组内的Pandas Series的最后和第一个条目),并通过减法计算它们的差值。由于这是一个时间序列,最后/第一个条目应该对应于组内Date的最大/最小条目。

因此,结果:

Group  agg(colC)  avg(colA)  avg(colB)  delta(Date)
0      0      101.0        2.0      -5.00            2
1      1      102.0        7.0      15.00            2
2      2      112.0       12.5      16.25            3

这里,结果有一个列Group,而我们的原始代码版本没有它。这是因为我们为分组定义了一个单独的Pandas系列group,而不是在本例中定义一个助手列Group

最新更新