我有下面的数据帧。
+-----------+----------+-----+
| InvoiceNo | ItemCode | Qty |
+-----------+----------+-----+
| Inv-001 | A | 2 |
+-----------+----------+-----+
| Inv-001 | B | 3 |
+-----------+----------+-----+
| Inv-001 | C | 1 |
+-----------+----------+-----+
| Inv-002 | B | 3 |
+-----------+----------+-----+
| Inv-002 | D | 4 |
+-----------+----------+-----+
| Inv-003 | C | 3 |
+-----------+----------+-----+
| Inv-003 | D | 9 |
+-----------+----------+-----+
| Inv-004 | D | 5 |
+-----------+----------+-----+
| Inv-004 | E | 8 |
+-----------+----------+-----+
| Inv-005 | X | 2 |
+-----------+----------+-----+
我的任务是根据项目出现的优先级创建一个额外的列Type
。
例如:ItemCode A
具有1st
优先级。 然后B
具有2nd
优先级,C
具有3rd
优先级。 其余项目具有least
优先级,分类具有Other
。
因此,如果任何发票包含项目A
,则无论是否存在其他项目,都应Type - A
类型。 从余额发票 如果项目B
包含,则类型应为Type - B
。 对于C
也是如此。 如果任何发票中都不存在A, B or C
,则类型应Type - Other
。
下面是我想要的输出。
+-----------+----------+-----+--------------+
| InvoiceNo | ItemCode | Qty | Type |
+-----------+----------+-----+--------------+
| Inv-001 | A | 2 | Type - A |
+-----------+----------+-----+--------------+
| Inv-001 | B | 3 | Type - A |
+-----------+----------+-----+--------------+
| Inv-001 | C | 1 | Type - A |
+-----------+----------+-----+--------------+
| Inv-002 | B | 3 | Type - B |
+-----------+----------+-----+--------------+
| Inv-002 | D | 4 | Type - B |
+-----------+----------+-----+--------------+
| Inv-003 | C | 3 | Type - C |
+-----------+----------+-----+--------------+
| Inv-003 | D | 9 | Type - C |
+-----------+----------+-----+--------------+
| Inv-004 | D | 5 | Type - Other |
+-----------+----------+-----+--------------+
| Inv-004 | E | 8 | Type - Other |
+-----------+----------+-----+--------------+
| Inv-005 | X | 2 | Type - Other |
+-----------+----------+-----+--------------+
下面是我的代码,它可以工作。但是,它更麻烦,根本不pythonic
。
# load Dataframe
df = pd.read_excel()
# filter data containing `A`
mask_A = (df['ItemCode'] == 'A').groupby(df['InvoiceNo']).transform('any')
df_A = df[mask_A]
df_A['Type'] = 'Type - A'
# form the rest of the data, filter data containing `B`
df = df[~mask_A]
mask_B = (df['ItemCode'] == 'B').groupby(df['InvoiceNo']).transform('any')
df_B = df[mask_B]
df_B['Type'] = 'Type - B'
# form the rest of the data, filter data containing `c`
df = df[~mask_B]
mask_C = (df['ItemCode'] == 'C').groupby(df['InvoiceNo']).transform('any')
df_C = df[mask_C]
df_C['Type'] = 'Type - C'
# form the rest of the data, filter data doesnt contain `A, B or C`
df_Other = df[~mask_C]
df_Other['Type'] = 'Type - Other'
# Conctenate all the dataframes
df = pd.concat([df_A, df_B, df_C, df_Other], axis=0,sort=False)
现在,最efficient
和最pythonic
的方法是什么?
我觉得我们可以做Categorical
然后transform
df['Type']=pd.Categorical(df.ItemCode,['A','B','C'],ordered=True)
df['Type']='Type_'+df.groupby('InvoiceNo')['Type'].transform('min').fillna('other')
更新
df['Type']=pd.Categorical(df.ItemCode,['A','B','C'],ordered=True)
df=df.sort_values('Type')
df['Type']='Type_'+df.groupby('InvoiceNo')['Type'].transform('first').fillna('other')
df=df.sort_index()
df
Out[32]:
InvoiceNo ItemCode Qty Type
0 Inv-001 A 2 Type_A
1 Inv-001 B 3 Type_A
2 Inv-001 C 1 Type_A
3 Inv-002 B 3 Type_B
4 Inv-002 D 4 Type_B
5 Inv-003 C 3 Type_C
6 Inv-003 D 9 Type_C
7 Inv-004 D 5 Type_other
8 Inv-004 E 8 Type_other
9 Inv-005 X 2 Type_other