首先,很抱歉描述太长,但我希望每个人都能理解我所做的事情的问题。
我正在研究一个可以预测14种不同病理的检测模型,我已经制作了一个推理文件,可以预测任何新的测试图像。数据集的测试图像约为25k以上,我已经找到了他们的预测,并制作了一个类似于此Dataframe的文件。
在这个数据框架中,我有(了解我的scnario的一些信息):
image_name______00000003_000.png
label_____[[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024.0, 1024.0], [119.195767195767, 339.166137566138, 470.281481481481, 511.458201058202]], ['Cardiomegaly', 'Edema', 'Infiltration']]
Bounding_Box_____True/False
Atelectasis _____0.172639399766922
Cardiomegaly _____0.064461663365364
Consolidation _____0.436323910951614
Edema _____0.152604594826698
Effusion _____0.077432356774807
Emphysema _____0.569778263568878
Fibrosis _____0.333310723304749
Hernia _____0.219542726874351
Infiltration _____0.240452200174332
Mass _____0.291741400957108
Nodule _____0.076222963631153
Pleural_Thickening_____ 0.294208467006683
Pneumonia _____0.281939893960953
Pneumothorax _____0.386653006076813
我想要什么:我们可以通过两种方法找到它:例如,为每个单独的类获取这些行。类似于首先查找包含Cardiomegaly
单标签或多标签的所有行。
然后应用以下操作或根据需要和专业知识查找TP
。
我想要的是像['Cardiomegaly', 'Edema', 'Infiltration']
一样具有基本事实并且具有14种病理概率的图像。如果这些实际标签具有的最高概率值,我想找到True Positive
就像Cardiomegaly
一样,如果它发现了最高的问题,那么制作一个新的col
并将其放入True
。我不知道我应该为多标签做什么,在找到第一个后,我应该为第二个label
做什么,如果它的概率最高,那么我可以如何操作。在@tlentali的帮助下,我完成了最后一次尝试。谢谢你的帮助。以下是我所做的:
df = pd.read_csv('/home/ali/Desktop/CX/sample.csv')
df["best_score"] = df.drop(['file', 'set', 'label', 'bbx'], axis=1).idxmax(axis=1)
df['evaluation'] = df.apply(lambda x: x["best_score"] in x["label"], axis=1)
df.groupby('best_score')['evaluation'].mean()
这让我喜欢:
best_score
Atelectasis 0.452465
Cardiomegaly 0.250000
Consolidation 0.123164
Edema 0.029520
Effusion 0.555459
Emphysema 0.068618
Fibrosis 0.066116
Hernia 0.032258
Infiltration 0.400000
Mass 0.177524
Nodule 0.604167
Pleural_Thickening 0.188482
Pneumonia 0.049133
Pneumothorax 0.108156
Name: evaluation, dtype: float64
这不是我想要的,它只是一个单一的标签,而不是多个。请帮我一下,很抱歉描述太长,但只是每个人都明白我想要什么。谢谢
来自DataFrame
:
>>> import pandas as pd
>>> df
file set label bbx Atelectasis Cardiomegaly Consolidation Edema Effusion Emphysema Fibrosis Hernia Infiltration Mass Nodule Pleural_Thickening Pneumonia Pneumothorax
0 00000003_000.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.145712 0.028958 0.205006 0.055228 0.115680 0.376638 0.349124 0.357694 0.122496 0.202218 0.075018 0.118994 0.195345 0.215577
1 00000003_001.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.132639 0.046136 0.169713 0.092743 0.285383 0.614464 0.311035 0.344040 0.117032 0.447748 0.152327 0.094364 0.174125 0.316022
2 00000003_002.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.233026 0.042541 0.227911 0.047988 0.116835 0.595102 0.330304 0.367272 0.117985 0.298624 0.109354 0.133473 0.185444 0.379627
3 00000003_003.png Test [[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024.... False 0.298693 0.022646 0.237977 0.035348 0.143645 0.487804 0.384509 0.379062 0.083205 0.625744 0.102377 0.207353 0.184517 0.354402
4 00000003_004.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.522152 0.052897 0.237475 0.082139 0.200029 0.473421 0.377468 0.336104 0.106339 0.488078 0.088047 0.146686 0.200919 0.313684
首先,我们eval
列label
,以便提取我们期望预测的类:
>>> df['label'] = df['label'].apply(eval)
>>> df['class'] = df.label.apply(lambda x: x[1])
>>> df
0 [Hernia]
1 [Hernia]
2 [Hernia]
3 [Hernia, Infiltration]
4 [Hernia]
5 [Hernia]
6 [Hernia]
7 [Hernia]
8 [No Finding]
9 [Emphysema, Pneumothorax]
10 [Emphysema, Pneumothorax]
11 [Pleural_Thickening]
12 [Effusion, Emphysema, Infiltration, Pneumothorax]
13 [Emphysema, Infiltration, Pleural_Thickening, ...
14 [Effusion, Infiltration]
15 [Infiltration]
Name: class, dtype: object
然后,我们explode
列class
以逐行获得期望的类,如下所示:
>>> df = df.explode('class')
>>> df = df.reset_index(drop=True)
>>> df['class']
0 Hernia
1 Hernia
2 Hernia
3 Hernia
4 Infiltration
5 Hernia
6 Hernia
7 Hernia
8 Hernia
9 No Finding
10 Emphysema
11 Pneumothorax
12 Emphysema
13 Pneumothorax
14 Pleural_Thickening
15 Effusion
16 Emphysema
17 Infiltration
18 Pneumothorax
19 Emphysema
20 Infiltration
21 Pleural_Thickening
22 Pneumothorax
23 Effusion
24 Infiltration
25 Infiltration
Name: class, dtype: object
然后,我们将数据转换为dummies格式:
>>> classes = ['Atelectasis',
... 'Cardiomegaly',
... 'Consolidation',
... 'Edema',
... 'Effusion',
... 'Emphysema',
... 'Fibrosis',
... 'Hernia',
... 'Infiltration',
... 'Mass',
... 'Nodule',
... 'Pleural_Thickening',
... 'Pneumonia',
... 'Pneumothorax',
... 'No Finding']
>>> s = df['class']
>>> df_classes = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
>>> df_classes.head()
Effusion Emphysema Hernia Infiltration No Finding Pleural_Thickening Pneumothorax
0 0 0 1 0 0 0 0
1 0 0 1 0 0 0 0
2 0 0 1 0 0 0 0
3 0 0 1 0 0 0 0
4 0 0 0 1 0 0 0
由于我们目前正在处理一个玩具数据集,我们必须进行一些调整,以便将所有想要的类作为假人格式进行利用:
>>> df_classes['Atelectasis'] = 0
>>> df_classes['Cardiomegaly'] = 0
>>> df_classes['Consolidation'] = 0
>>> df_classes['Edema'] = 0
>>> df_classes['Fibrosis'] = 0
>>> df_classes['Mass'] = 0
>>> df_classes['Nodule'] = 0
>>> df_classes['Pneumonia'] = 0
>>> df['No Finding'] = 0
现在,我们可以使用sklearn
来获得TRP
,并最终获得AUC
:
from sklearn.metrics import roc_curve, auc
n_classes = len(classes)
y_test = df_classes[classes].to_numpy()
y_score = df[classes].to_numpy()
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
现在,我们可以看看roc_auc
的值,nan
是由于并非所有类都在玩具数据集中预测:
>>> roc_auc
1: nan,
2: nan,
3: nan,
4: 0.3125,
5: 0.7613636363636364,
6: nan,
7: 0.9479166666666666,
8: 0.6190476190476191,
9: nan,
10: nan,
11: 0.30208333333333337,
12: nan,
13: 0.7840909090909091,
14: 0.5,
'micro': 0.66562764158918}
现在,我们可以根据每个类的TPR
和FPR
绘制ROC_AUC
曲线(这里注意到classe
,当我们处理玩具数据集时,一些类是空的):
import matplotlib.pyplot as plt
plt.figure()
lw = 2
classe = 7
plt.plot(fpr[classe], tpr[classe], color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[classe])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.show()