我有以下处理表(表1):
Fuse Ident Grade
A1 BLU123 skyline
A1 RED235 blue
A1 RED345 ortho
B1 RED160 linx
B1 BLU760 milli
B2 BLU222 moli
B2 RED201 straw
C1 RED201 straw
C2 BLU222 moli
D1 RED235 blue
D1 BLU123 skyline
E1 NA NA NA
E2 NA NA NA
F1 BLU999 monte
F1 BLU23 nome
我正在尝试处理上面的表以获得一个最终表(表2),如下所示:
Option1:
Grade1 Grade2 row#
blue skyline 2
ortho skyline 1
linx milli 1
straw moli 2
选项2:
Grade1 Grade2 row#
blue;ortho skyline 1
linx milli 1
straw moli 2
下表解释了如何得到上面的最终表:
RED BLU Grade1 Grade2
RED235 BLU123 blue skyline
RED345 BLU123 ortho skyline
RED160 BLU760 linx milli
RED201 BLU222 blue skyline
专栏"Fuse"表1中有多个特定值的条目(例如,A1)。在"A1"对于列"Fuse"的表子集,对应的"Ident"列必须至少有一个以"BLU"开头的条目。和";RED"为记录Grade1和Grade2列的值创建一个完整的集合。两套完全可以从"A1"子集(RED235 BLU123;RED345 BLU123)。最后,如表2所示,记录一、二级合并列的出现次数。其他条件包括:列的空行"融合"one_answers";Ident",当没有以"BLU"开头的字符串的互补集时,将被忽略。和";RED"在"标识"列中,行数为
这是我的尝试:
grp=df.groupby(["Fuse","Ident"])
[grp.get_group(x) for x in grp.groups]
我的代码不工作。有人能帮我写个更好的代码吗?由于
count聚合次数
out = (df
# extract BLU/RED as col (other methods are possible)
.assign(group=df['Ident'].str[:3])
.groupby(['Fuse', 'group'])['Grade'].agg(';'.join) # aggregate multiple occurrences
.unstack().value_counts() # count values
.reset_index(name='count')
)
输出:
BLU RED count
0 milli linx 1
1 moli straw 1
2 skyline blue 1
3 skyline blue;ortho 1
老回答h5> 可以使用pivot
作为主步骤:
out = (df
# extract BLU/RED as col (other methods are possible)
.assign(col=df['Ident'].str[:3],
# get position of each BLU/RED per group
idx=lambda d: d.groupby(['Fuse', 'col']).cumcount()
)
.pivot(index=['Fuse', 'idx'], columns='col') # pivot using new columns
.dropna(how='all', axis=1) # drop empty columns
.dropna(how='any', axis=0) # drop rows with missing values
.droplevel('idx') # drop intermediate level
)
out = (df
# extract BLU/RED as col (other methods are possible)
.assign(col=df['Ident'].str[:3],
# get position of each BLU/RED per group
idx=lambda d: d.groupby(['Fuse', 'col']).cumcount()
)
.pivot(index=['Fuse', 'idx'], columns='col') # pivot using new columns
.dropna(how='all', axis=1) # drop empty columns
.dropna(how='any', axis=0) # drop rows with missing values
.droplevel('idx') # drop intermediate level
)
输出:
Ident Grade
col BLU RED BLU RED
Fuse
A1 BLU123 RED235 skyline blue
B1 BLU760 RED160 milli linx
B2 BLU222 RED201 moli straw
D1 BLU123 RED235 skyline blue
然后你可以计算这些值:
out['Grade'].value_counts()
输出:
BLU RED
skyline blue 2
milli linx 1
moli straw 1