我试图比较两个数据框架中的数据,以查找标题"A-Score"下缺失或具有不同值的行。然而,我的脚本给出的结果有点出乎意料。如何解决这个问题?
import pandas as pd
print(df1)
Ensembl_ID length score identity p_value R-Site
0 ENSG00000000457 208 42.98 92.857 4.390000e-34 110
1 ENSG00000000457 208 42.98 92.857 4.390000e-34 133
2 ENSG00000000457 208 42.98 92.857 4.390000e-34 149
3 ENSG00000000457 208 42.98 92.857 4.390000e-34 164
4 ENSG00000000460 349 56.10 100.000 0.000000e+00 90
5 ENSG00000000460 349 56.10 100.000 0.000000e+00 168
6 ENSG00000000460 349 56.10 100.000 0.000000e+00 187
7 ENSG00000000460 349 56.10 100.000 0.000000e+00 297
8 ENSG00000000460 349 56.10 100.000 0.000000e+00 317
9 ENSG00000000460 349 56.10 100.000 0.000000e+00 336
10 ENSG00000004399 656 130.45 100.000 0.000000e+00 134
11 ENSG00000004399 656 130.45 100.000 0.000000e+00 151
12 ENSG00000004399 656 130.45 100.000 0.000000e+00 153
13 ENSG00000004399 656 130.45 100.000 0.000000e+00 204
14 ENSG00000004399 656 130.45 100.000 0.000000e+00 290
15 ENSG00000004399 656 130.45 100.000 0.000000e+00 298
16 ENSG00000004399 656 130.45 100.000 0.000000e+00 342
17 ENSG00000004399 656 130.45 100.000 0.000000e+00 362
18 ENSG00000004399 656 130.45 100.000 0.000000e+00 431
19 ENSG00000004399 656 130.45 100.000 0.000000e+00 434
20 ENSG00000004399 656 130.45 100.000 0.000000e+00 514
21 ENSG00000004399 656 130.45 100.000 0.000000e+00 516
22 ENSG00000004399 656 130.45 100.000 0.000000e+00 556
23 ENSG00000004399 656 130.45 100.000 0.000000e+00 576
R-PercentPosition R-Score
0 52.884615 0.147
1 63.942308 0.040
2 71.634615 0.105
3 78.846154 0.063
4 25.787966 0.711
5 48.137536 0.094
6 53.581662 0.252
7 85.100287 0.726
8 90.830946 0.024
9 96.275072 0.001
10 20.426829 0.015
11 23.018293 0.017
12 23.323171 0.528
13 31.097561 0.044
14 44.207317 0.008
15 45.426829 0.111
16 52.134146 0.382
17 55.182927 0.042
18 65.701220 0.002
19 66.158537 0.001
20 78.353659 0.014
21 78.658537 0.872
22 84.756098 0.243
23 87.347561 0.115
print(df2)
Ensembl_ID length score identity p_value A-Site
0 ENSG00000000457 208 42.98 92.857 4.390000e-34 133
1 ENSG00000000457 208 42.98 92.857 4.390000e-34 149
2 ENSG00000000457 208 42.98 92.857 4.390000e-34 164
3 ENSG00000000460 349 56.61 100.000 0.000000e+00 90
4 ENSG00000000460 349 56.61 100.000 0.000000e+00 168
5 ENSG00000000460 349 56.61 100.000 0.000000e+00 187
6 ENSG00000000460 349 56.61 100.000 0.000000e+00 297
7 ENSG00000000460 349 56.61 100.000 0.000000e+00 317
8 ENSG00000000460 349 56.61 100.000 0.000000e+00 336
9 ENSG00000004399 656 131.30 100.000 0.000000e+00 134
10 ENSG00000004399 656 131.30 100.000 0.000000e+00 151
11 ENSG00000004399 656 131.30 100.000 0.000000e+00 153
12 ENSG00000004399 656 131.30 100.000 0.000000e+00 204
13 ENSG00000004399 656 131.30 100.000 0.000000e+00 290
14 ENSG00000004399 656 131.30 100.000 0.000000e+00 298
15 ENSG00000004399 656 131.30 100.000 0.000000e+00 342
16 ENSG00000004399 656 131.30 100.000 0.000000e+00 362
17 ENSG00000004399 656 131.30 100.000 0.000000e+00 431
18 ENSG00000004399 656 131.30 100.000 0.000000e+00 434
19 ENSG00000004399 656 131.30 100.000 0.000000e+00 514
20 ENSG00000004399 656 131.30 100.000 0.000000e+00 516
21 ENSG00000004399 656 131.30 100.000 0.000000e+00 556
22 ENSG00000004399 656 131.30 100.000 0.000000e+00 573
A-PercentPosition A-Score
0 63.942308 0.040
1 71.634615 0.105
2 78.846154 0.063
3 25.787966 0.711
4 48.137536 0.094
5 53.581662 0.252
6 85.100287 0.726
7 90.830946 0.024
8 96.275072 0.001
9 20.426829 0.251
10 23.018293 0.148
11 23.323171 0.021
12 31.097561 0.099
13 44.207317 0.070
14 45.426829 0.065
15 52.134146 0.115
16 55.182927 0.024
17 65.701220 0.425
18 66.158537 0.413
19 78.353659 0.469
20 78.658537 0.519
21 84.756098 0.506
22 87.347561 0.169
df1['compare_Scores'] = df1['R-Score'].isin(df2['A-Score'])
print(df1)
Ensembl_ID length score identity p_value R-Site
0 ENSG00000000457 208 42.98 92.857 4.390000e-34 110
1 ENSG00000000457 208 42.98 92.857 4.390000e-34 133
2 ENSG00000000457 208 42.98 92.857 4.390000e-34 149
3 ENSG00000000457 208 42.98 92.857 4.390000e-34 164
4 ENSG00000000460 349 56.10 100.000 0.000000e+00 90
5 ENSG00000000460 349 56.10 100.000 0.000000e+00 168
6 ENSG00000000460 349 56.10 100.000 0.000000e+00 187
7 ENSG00000000460 349 56.10 100.000 0.000000e+00 297
8 ENSG00000000460 349 56.10 100.000 0.000000e+00 317
9 ENSG00000000460 349 56.10 100.000 0.000000e+00 336
10 ENSG00000004399 656 130.45 100.000 0.000000e+00 134
11 ENSG00000004399 656 130.45 100.000 0.000000e+00 151
12 ENSG00000004399 656 130.45 100.000 0.000000e+00 153
13 ENSG00000004399 656 130.45 100.000 0.000000e+00 204
14 ENSG00000004399 656 130.45 100.000 0.000000e+00 290
15 ENSG00000004399 656 130.45 100.000 0.000000e+00 298
16 ENSG00000004399 656 130.45 100.000 0.000000e+00 342
17 ENSG00000004399 656 130.45 100.000 0.000000e+00 362
18 ENSG00000004399 656 130.45 100.000 0.000000e+00 431
19 ENSG00000004399 656 130.45 100.000 0.000000e+00 434
20 ENSG00000004399 656 130.45 100.000 0.000000e+00 514
21 ENSG00000004399 656 130.45 100.000 0.000000e+00 516
22 ENSG00000004399 656 130.45 100.000 0.000000e+00 556
23 ENSG00000004399 656 130.45 100.000 0.000000e+00 573
R-PercentPosition R-Score compare_Scores
0 52.884615 0.147 False
1 63.942308 0.040 True
2 71.634615 0.105 True
3 78.846154 0.063 True
4 25.787966 0.711 True
5 48.137536 0.094 True
6 53.581662 0.252 True
7 85.100287 0.726 True
8 90.830946 0.024 True
9 96.275072 0.001 True
10 20.426829 0.015 False
11 23.018293 0.017 False
12 23.323171 0.528 False
13 31.097561 0.044 False
14 44.207317 0.008 False
15 45.426829 0.111 False
16 52.134146 0.382 False
17 55.182927 0.042 False
18 65.701220 0.002 False
19 66.158537 0.001 True
20 78.353659 0.014 False
21 78.658537 0.872 False
22 84.756098 0.243 False
23 87.347561 0.115 True
结果中,正如预期的那样,第0行显示"false",因为在df2中没有R-Site值110。
但是第19和23行的R-Score值在df1和df2之间是不相同的。然而,结果显示"正确"。
是否有更好的方法根据"r - score"中的值来查找df1和df2之间的差异?列?
我不认为你没有做你认为你在做的事。
通过发布df1['compare_Scores'] = df1['R-Score'].isin(df2['A-Score'])
,您正在查看df1中R-Score
列中的每个值(在您的示例中,您正在谈论R-Site
值110,但您正在使用R-Score
,而不是R-Site
),并检查该值是否存在于df2中的A-Score
列中(不一定在同一行索引中)。因此,对于第19行,R-score为0.001,并且它出现在第8行df2的a列分数中,因此答案为True。
如果你想要做的是假当行x在df1['R-Score']不同于同一行x在df2['A-Score']中,否则为True,那么您可以执行类似df1['compare_Scores'] = df1['R-Score'] == df2['A-Score']
的操作。
请注意,要使工作,您需要df1和df2索引对齐,而在您的示例中并非如此(df1有24行从0到23的索引,而df2有23行从0到22的索引)。
你的逻辑问题是"将验证列值是否在任何一个索引中找到,然后返回True。在您的样本数据中,df2值的第19个指数0.001出现在您的df1的第8个指数中。
df1['compare_Scores'] = df1['R-Score'].isin(df2['A-Score'])
如果你想进行索引明智的比较,下面的逻辑为你工作。
df1['compare_Scores'] = df1.R-Score == df2.A-Score
解决方法:
df1['compare_Scores'] = df1['R-Score'].isin(df2['A-Score'])
#print(df1)
if False in df1['compare_Sites']:
print(df1[df1['compare_Sites'] == False])
Ensembl_ID length score identity p_value R-Site
0 ENSG00000000457 208 42.98 92.857 4.390000e-34 110
10 ENSG00000004399 656 130.45 100.000 0.000000e+00 134
11 ENSG00000004399 656 130.45 100.000 0.000000e+00 151
12 ENSG00000004399 656 130.45 100.000 0.000000e+00 153
13 ENSG00000004399 656 130.45 100.000 0.000000e+00 204
14 ENSG00000004399 656 130.45 100.000 0.000000e+00 290
15 ENSG00000004399 656 130.45 100.000 0.000000e+00 298
16 ENSG00000004399 656 130.45 100.000 0.000000e+00 342
17 ENSG00000004399 656 130.45 100.000 0.000000e+00 362
18 ENSG00000004399 656 130.45 100.000 0.000000e+00 431
20 ENSG00000004399 656 130.45 100.000 0.000000e+00 514
21 ENSG00000004399 656 130.45 100.000 0.000000e+00 516
22 ENSG00000004399 656 130.45 100.000 0.000000e+00 556
R-PercentPosition R-Score compare_Sites
0 52.884615 0.147 False
10 20.426829 0.015 False
11 23.018293 0.017 False
12 23.323171 0.528 False
13 31.097561 0.044 False
14 44.207317 0.008 False
15 45.426829 0.111 False
16 52.134146 0.382 False
17 55.182927 0.042 False
18 65.701220 0.002 False
20 78.353659 0.014 False
21 78.658537 0.872 False
22 84.756098 0.243 False