Python:连续几天的最大发生



我有一个输入文件:

ID,ROLL_NO,ADM_DATE,FEES
1,12345,01/12/2016,500
2,12345,02/12/2016,200
3,987654,01/12/2016,1000
4,12345,03/12/2016,0
5,12345,04/12/2016,0
6,12345,05/12/2016,100
7,12345,06/12/2016,0
8,12345,07/12/2016,0
9,12345,08/12/2016,0
10,987654,02/12/2016,150
11,987654,03/12/2016,300

我试图找到连续数天的最大计数,其中FEES为0的ROLL_NO 为0。如果FEES连续几天不等于零,则该特定ROLL_NO的最大计数将为零。

预期输出:

ID,ROLL_NO,MAX_CNT --  First occurrence of ID for a particular ROLL_NO should come as ID in output 
1,12345,3
3,987654,0

这就是我到目前为止提出的,

import pandas as pd
df = pd.read_csv('I5.txt')
df['COUNT'] = df.groupby(['ROLLNO','ADM_DATE'])['ROLLNO'].transform(pd.Series.value_counts)
print df

,但我不认为这是解决此问题的正确方法。

有人可以在这里帮助一个新手吗?

您可以使用:

#consecutive groups
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
print (a)
ID
1     1
2     1
3     1
4     2
5     2
6     3
7     4
8     4
9     4
10    5
11    5
dtype: int32
#filter 0 FEES, count, get max per first level and last add missing roll no by reindex
mask = df['FEES'].eq(0)
df = (df[mask].groupby(['ROLL_NO',a[mask]])
              .size()
              .max(level=0)
              .reindex(df['ROLL_NO'].unique(), fill_value=0)
              .reset_index(name='MAX_CNT'))
print (df)
   ROLL_NO  MAX_CNT
0    12345        3
1   987654        0

说明:

首先将FEES列与0进行比较,eq==相同,通过列ROLL_NO进行多个掩码:

mask = df['FEES'].eq(0)
r = df['ROLL_NO'] * mask
print (r)
0         0
1         0
2         0
3     12345
4     12345
5         0
6     12345
7     12345
8     12345
9         0
10        0
dtype: int64

通过比较shift ED系列rcumsum获取连续组:

a = r.ne(r.shift()).cumsum()
print (a)
0     1
1     1
2     1
3     2
4     2
5     3
6     4
7     4
8     4
9     5
10    5
dtype: int32

FEES中仅0和与size的Groupby,同样的 a forter CC_16:

print (df[mask].groupby(['ROLL_NO',a[mask]]).size())
ROLL_NO   
12345    2    2
         4    3
dtype: int64

获得MultiIndex的第一个级别的max值:

print (df[mask].groupby(['ROLL_NO',a[mask]]).size().max(level=0))
ROLL_NO
12345    3
dtype: int64

最后添加 reindex的CC_21添加丢失的ROLL_NO

print (df[mask].groupby(['ROLL_NO',a[mask]])
              .size()
              .max(level=0)
              .reindex(df['ROLL_NO'].unique(), fill_value=0))
ROLL_NO
12345     3
987654    0
dtype: int64

index的列使用reset_index

编辑:

首先使用insertmap使用drop_duplicates

r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
s = df.drop_duplicates('ROLL_NO').set_index('ROLL_NO')['ID']
mask = df['FEES'].eq(0)
df1 = (df[mask].groupby(['ROLL_NO',a[mask]])
               .size()
               .max(level=0)
               .reindex(df['ROLL_NO'].unique(), fill_value=0)
               .reset_index(name='MAX_CNT'))
df1.insert(0, 'ID', df1['ROLL_NO'].map(s)) 
print (df1)
   ID  ROLL_NO  MAX_CNT
0   1    12345        3
1   3   987654        0

最新更新