如何在有基本事实的情况下只为数据帧找到真命题

首先，很抱歉描述太长，但我希望每个人都能理解我所做的事情的问题。

我正在研究一个可以预测14种不同病理的检测模型，我已经制作了一个推理文件，可以预测任何新的测试图像。数据集的测试图像约为25k以上，我已经找到了他们的预测，并制作了一个类似于此Dataframe的文件。

在这个数据框架中，我有(了解我的scnario的一些信息)：

image_name______00000003_000.png
label_____[[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024.0, 1024.0], [119.195767195767, 339.166137566138, 470.281481481481, 511.458201058202]], ['Cardiomegaly', 'Edema', 'Infiltration']]
Bounding_Box_____True/False
Atelectasis _____0.172639399766922
Cardiomegaly _____0.064461663365364
Consolidation _____0.436323910951614
Edema _____0.152604594826698
Effusion _____0.077432356774807
Emphysema _____0.569778263568878
Fibrosis _____0.333310723304749
Hernia _____0.219542726874351
Infiltration _____0.240452200174332
Mass _____0.291741400957108
Nodule _____0.076222963631153
Pleural_Thickening_____ 0.294208467006683
Pneumonia _____0.281939893960953
Pneumothorax _____0.386653006076813

我想要什么：我们可以通过两种方法找到它：例如，为每个单独的类获取这些行。类似于首先查找包含Cardiomegaly单标签或多标签的所有行。

然后应用以下操作或根据需要和专业知识查找TP。

我想要的是像['Cardiomegaly', 'Edema', 'Infiltration']一样具有基本事实并且具有14种病理概率的图像。如果这些实际标签具有的最高概率值，我想找到True Positive

就像Cardiomegaly一样，如果它发现了最高的问题，那么制作一个新的col并将其放入True。我不知道我应该为多标签做什么，在找到第一个后，我应该为第二个label做什么，如果它的概率最高，那么我可以如何操作。在@tlentali的帮助下，我完成了最后一次尝试。谢谢你的帮助。以下是我所做的：

df = pd.read_csv('/home/ali/Desktop/CX/sample.csv')
df["best_score"] = df.drop(['file', 'set', 'label', 'bbx'], axis=1).idxmax(axis=1)
df['evaluation'] = df.apply(lambda x: x["best_score"] in x["label"], axis=1)
df.groupby('best_score')['evaluation'].mean()

这让我喜欢：

best_score
Atelectasis           0.452465
Cardiomegaly          0.250000
Consolidation         0.123164
Edema                 0.029520
Effusion              0.555459
Emphysema             0.068618
Fibrosis              0.066116
Hernia                0.032258
Infiltration          0.400000
Mass                  0.177524
Nodule                0.604167
Pleural_Thickening    0.188482
Pneumonia             0.049133
Pneumothorax          0.108156
Name: evaluation, dtype: float64

这不是我想要的，它只是一个单一的标签，而不是多个。请帮我一下，很抱歉描述太长，但只是每个人都明白我想要什么。谢谢

来自DataFrame:

>>> import pandas as pd
>>> df
file    set     label                                        bbx    Atelectasis Cardiomegaly    Consolidation   Edema   Effusion    Emphysema   Fibrosis    Hernia  Infiltration    Mass    Nodule  Pleural_Thickening  Pneumonia   Pneumothorax
0   00000003_000.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.145712    0.028958    0.205006    0.055228    0.115680    0.376638    0.349124    0.357694    0.122496    0.202218    0.075018    0.118994    0.195345    0.215577
1   00000003_001.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.132639    0.046136    0.169713    0.092743    0.285383    0.614464    0.311035    0.344040    0.117032    0.447748    0.152327    0.094364    0.174125    0.316022
2   00000003_002.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.233026    0.042541    0.227911    0.047988    0.116835    0.595102    0.330304    0.367272    0.117985    0.298624    0.109354    0.133473    0.185444    0.379627
3   00000003_003.png    Test    [[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024....   False   0.298693    0.022646    0.237977    0.035348    0.143645    0.487804    0.384509    0.379062    0.083205    0.625744    0.102377    0.207353    0.184517    0.354402
4   00000003_004.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.522152    0.052897    0.237475    0.082139    0.200029    0.473421    0.377468    0.336104    0.106339    0.488078    0.088047    0.146686    0.200919    0.313684

首先，我们eval列label，以便提取我们期望预测的类：

>>> df['label'] = df['label'].apply(eval)
>>> df['class'] = df.label.apply(lambda x: x[1])
>>> df
0                                              [Hernia]
1                                              [Hernia]
2                                              [Hernia]
3                                [Hernia, Infiltration]
4                                              [Hernia]
5                                              [Hernia]
6                                              [Hernia]
7                                              [Hernia]
8                                          [No Finding]
9                             [Emphysema, Pneumothorax]
10                            [Emphysema, Pneumothorax]
11                                 [Pleural_Thickening]
12    [Effusion, Emphysema, Infiltration, Pneumothorax]
13    [Emphysema, Infiltration, Pleural_Thickening, ...
14                             [Effusion, Infiltration]
15                                       [Infiltration]
Name: class, dtype: object

然后，我们explode列class以逐行获得期望的类，如下所示：

>>> df = df.explode('class')
>>> df = df.reset_index(drop=True)
>>> df['class']
0                 Hernia
1                 Hernia
2                 Hernia
3                 Hernia
4           Infiltration
5                 Hernia
6                 Hernia
7                 Hernia
8                 Hernia
9             No Finding
10             Emphysema
11          Pneumothorax
12             Emphysema
13          Pneumothorax
14    Pleural_Thickening
15              Effusion
16             Emphysema
17          Infiltration
18          Pneumothorax
19             Emphysema
20          Infiltration
21    Pleural_Thickening
22          Pneumothorax
23              Effusion
24          Infiltration
25          Infiltration
Name: class, dtype: object

然后，我们将数据转换为dummies格式：

>>> classes = ['Atelectasis', 
...            'Cardiomegaly',
...            'Consolidation', 
...            'Edema', 
...            'Effusion', 
...            'Emphysema', 
...            'Fibrosis', 
...            'Hernia',
...            'Infiltration', 
...            'Mass', 
...            'Nodule', 
...            'Pleural_Thickening', 
...            'Pneumonia',
...            'Pneumothorax',
...            'No Finding']
>>> s = df['class']
>>> df_classes = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
>>> df_classes.head()
Effusion    Emphysema   Hernia  Infiltration    No Finding  Pleural_Thickening  Pneumothorax
0   0           0           1       0               0           0                   0
1   0           0           1       0               0           0                   0
2   0           0           1       0               0           0                   0
3   0           0           1       0               0           0                   0
4   0           0           0       1               0           0                   0

由于我们目前正在处理一个玩具数据集，我们必须进行一些调整，以便将所有想要的类作为假人格式进行利用：

>>> df_classes['Atelectasis'] = 0 
>>> df_classes['Cardiomegaly'] = 0 
>>> df_classes['Consolidation'] = 0 
>>> df_classes['Edema'] = 0 
>>> df_classes['Fibrosis'] = 0 
>>> df_classes['Mass'] = 0 
>>> df_classes['Nodule'] = 0 
>>> df_classes['Pneumonia'] = 0 
>>> df['No Finding'] = 0

现在，我们可以使用sklearn来获得TRP，并最终获得AUC:

from sklearn.metrics import roc_curve, auc

n_classes = len(classes)
y_test = df_classes[classes].to_numpy()
y_score = df[classes].to_numpy()
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

现在，我们可以看看roc_auc的值，nan是由于并非所有类都在玩具数据集中预测：

>>> roc_auc
1: nan,
2: nan,
3: nan,
4: 0.3125,
5: 0.7613636363636364,
6: nan,
7: 0.9479166666666666,
8: 0.6190476190476191,
9: nan,
10: nan,
11: 0.30208333333333337,
12: nan,
13: 0.7840909090909091,
14: 0.5,
'micro': 0.66562764158918}

现在，我们可以根据每个类的TPR和FPR绘制ROC_AUC曲线(这里注意到classe，当我们处理玩具数据集时，一些类是空的)：

import matplotlib.pyplot as plt

plt.figure()
lw = 2
classe = 7
plt.plot(fpr[classe], tpr[classe], color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[classe])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.show()

相关内容

最新更新

热门标签：