我有一个SDF文件,其中包含数千个分子和几个ID的文本文件,这些文件按某些特征组合在一起。现在,我有一个脚本,它加载到具有分子特征的CSV数据库中,并通过基于这些特征进行分类来生成ID文本文件。我想使用这些文本文件来解析SDF文件,以获得具有相应分子的新SDF文件。此外,我想在MATLAB中做这件事。
例如,以下是原始SDF文件中的一些分子:
NCGC00178831-03
Marvin 07111412562D
34 37 0 0 0 0 999 V2000
4.8814 -2.7443 0.0000 Cl 0 5 0 0 0 0 0 0 0 0 0 0
2.8647 -2.4751 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8647 -1.6501 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0
3.5808 -1.2318 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2970 -1.6501 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0017 -1.2318 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7179 -1.6501 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.0017 -0.4068 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2970 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5808 -0.4068 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8647 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1485 -0.4068 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1485 -1.2318 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4324 -1.6501 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7162 -1.2318 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -1.6501 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.7162 -0.4068 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4324 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8761 -3.5407 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.5923 -3.9590 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.3084 -3.5407 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0132 -3.9590 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7293 -3.5407 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.0132 -4.7840 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.3084 -5.1908 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5923 -4.7840 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8761 -5.1908 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1599 -4.7840 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1599 -3.9590 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4438 -3.5407 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7276 -3.9590 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0115 -3.5407 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.7276 -4.7840 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4438 -5.1908 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2 3 1 0 0 0 0
3 4 2 0 0 0 0
3 13 1 0 0 0 0
4 5 1 0 0 0 0
4 10 1 0 0 0 0
5 6 2 0 0 0 0
6 7 1 0 0 0 0
6 8 1 0 0 0 0
8 9 2 0 0 0 0
9 10 1 0 0 0 0
10 11 2 0 0 0 0
11 12 1 0 0 0 0
12 13 2 0 0 0 0
12 18 1 0 0 0 0
13 14 1 0 0 0 0
14 15 2 0 0 0 0
15 16 1 0 0 0 0
15 17 1 0 0 0 0
17 18 2 0 0 0 0
19 20 2 0 0 0 0
19 29 1 0 0 0 0
20 21 1 0 0 0 0
20 26 1 0 0 0 0
21 22 2 0 0 0 0
22 23 1 0 0 0 0
22 24 1 0 0 0 0
24 25 2 0 0 0 0
25 26 1 0 0 0 0
26 27 2 0 0 0 0
27 28 1 0 0 0 0
28 29 2 0 0 0 0
28 34 1 0 0 0 0
29 30 1 0 0 0 0
30 31 2 0 0 0 0
31 32 1 0 0 0 0
31 33 1 0 0 0 0
33 34 2 0 0 0 0
M CHG 2 1 -1 3 1
M END
> <Formula>
C27H25ClN6
> <FW>
468.9806 (35.4535+224.2805+209.2465)
> <DSSTox_CID>
25848
> <SR-HSE>
0
$$$$
NCGC00166114-03
Marvin 07111412562D
31 32 0 0 0 0 999 V2000
4.9884 -1.2417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.9884 -2.0696 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2748 -2.4764 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2748 -3.7038 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.9884 -4.1178 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7021 -3.7038 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.4157 -4.1178 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
5.7021 -2.8760 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.9884 -4.9385 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2748 -5.3524 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5612 -4.9385 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5612 -4.1178 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5612 -2.0696 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5612 -1.2417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2748 -0.8279 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.8403 -0.8279 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1267 -1.2417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1267 -2.0696 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8403 -2.4764 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4202 -2.4764 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0
1.4202 -0.8279 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
2.8403 0.0000 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0
5.7021 -2.4764 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.4229 -2.0696 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.4229 -1.2417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7021 -0.8279 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7021 0.0000 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0
7.1366 -0.8279 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
7.1366 -2.4764 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0
7.0866 -4.1963 0.0000 Na 0 3 0 0 0 0 0 0 0 0 0 0
0.0000 -0.7708 0.0000 Na 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 15 1 0 0 0 0
1 26 2 0 0 0 0
2 3 2 0 0 0 0
2 23 1 0 0 0 0
3 4 1 0 0 0 0
3 13 1 0 0 0 0
4 5 2 0 0 0 0
4 12 1 0 0 0 0
5 6 1 0 0 0 0
5 9 1 0 0 0 0
6 7 1 0 0 0 0
6 8 2 0 0 0 0
9 10 2 0 0 0 0
10 11 1 0 0 0 0
11 12 2 0 0 0 0
13 14 2 0 0 0 0
13 19 1 0 0 0 0
14 15 1 0 0 0 0
14 16 1 0 0 0 0
16 17 2 0 0 0 0
16 22 1 0 0 0 0
17 18 1 0 0 0 0
17 21 1 0 0 0 0
18 19 2 0 0 0 0
18 20 1 0 0 0 0
23 24 2 0 0 0 0
24 25 1 0 0 0 0
24 29 1 0 0 0 0
25 26 1 0 0 0 0
25 28 2 0 0 0 0
26 27 1 0 0 0 0
M CHG 4 7 -1 21 -1 30 1 31 1
M END
> <Formula>
C20H6Br4Na2O5
> <FW>
691.8542 (645.8757+22.9892+22.9892)
> <DSSTox_CID>
5234
> <SR-HSE>
0
$$$$
NCGC00263563-01
Marvin 07111412562D
71 76 0 0 1 0 999 V2000
2.1953 -4.9878 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
3.6803 -4.9878 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
2.9701 -5.4074 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.5858 -4.9878 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
5.1008 -4.9878 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
2.1953 -4.1484 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
11.8157 -5.6335 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
14.1239 -5.8755 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
11.0893 -5.1008 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
3.6803 -4.1484 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
10.2015 -5.1008 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
12.5905 -5.1653 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
14.9633 -5.8755 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
4.3905 -5.4074 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
5.8755 -5.4074 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.9701 -3.6803 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
11.4606 -4.3905 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
13.6558 -5.1653 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
9.5559 -5.5043 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
7.2476 -5.5043 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.1008 -4.1484 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
1.4850 -5.4074 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
11.8157 -2.4858 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.9578 -4.9878 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
6.5858 -4.1484 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
12.5905 -2.9055 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
12.3483 -4.3905 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
11.8157 -1.6626 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.8755 -3.6803 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
13.3008 -1.6626 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
12.5905 -1.2429 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
13.3008 -2.4858 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
8.8457 -4.9878 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
11.4606 -3.1961 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
14.1239 -4.5035 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
0.7748 -4.9878 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
15.4314 -5.2137 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
14.9633 -4.5035 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
9.9756 -4.2776 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -5.4074 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
7.6673 -4.2776 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1953 -5.7464 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.8764 -4.2776 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
9.0877 -4.2776 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7748 -4.1484 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
14.5437 -6.4567 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.6803 -3.3736 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.9701 -2.9055 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.8755 -2.9055 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
14.0110 -1.2429 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
12.5905 -0.4197 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.4850 -3.6803 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
15.5444 -6.4082 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
10.5566 -4.3905 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.3905 -6.1177 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.5035 -3.7933 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.1838 -4.2776 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
14.0110 -2.9055 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
13.6558 -3.7449 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
16.1416 -5.2137 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2130 -2.9701 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1953 -2.3729 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
14.7858 -1.6626 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
13.3008 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
11.0893 -5.8755 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
12.5905 -5.9885 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
8.8941 -5.7464 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
3.6803 -5.7464 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
5.1008 -5.7464 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
13.6558 -5.9885 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.4681 -6.7634 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0
1 3 1 0 0 0 0
1 6 1 0 0 0 0
1 22 1 6 0 0 0
1 42 1 1 0 0 0
2 3 1 0 0 0 0
2 14 1 0 0 0 0
2 68 1 1 0 0 0
2 10 1 0 0 0 0
4 15 1 0 0 0 0
4 20 1 1 0 0 0
4 43 1 0 0 0 0
4 25 1 0 0 0 0
5 14 1 0 0 0 0
5 15 1 0 0 0 0
5 21 1 0 0 0 0
5 69 1 1 0 0 0
6 16 1 0 0 0 0
6 52 1 1 0 0 0
7 9 1 0 0 0 0
7 12 1 0 0 0 0
8 18 1 0 0 0 0
8 13 1 0 0 0 0
9 11 1 0 0 0 0
9 17 1 0 0 0 0
9 65 1 6 0 0 0
10 16 1 0 0 0 0
10 47 1 1 0 0 0
11 19 1 0 0 0 0
11 54 1 6 0 0 0
11 39 1 0 0 0 0
12 18 1 0 0 0 0
12 66 1 1 0 0 0
12 27 1 0 0 0 0
13 46 1 1 0 0 0
13 53 1 6 0 0 0
13 37 1 0 0 0 0
14 55 1 1 0 0 0
16 48 1 6 0 0 0
17 27 1 0 0 0 0
17 34 1 1 0 0 0
18 35 1 0 0 0 0
18 70 1 1 0 0 0
19 33 1 0 0 0 0
20 24 1 0 0 0 0
21 29 1 0 0 0 0
21 56 1 6 0 0 0
22 36 1 0 0 0 0
23 34 1 0 0 0 0
23 26 1 0 0 0 0
23 28 1 0 0 0 0
24 33 1 0 0 0 0
24 57 1 6 0 0 0
24 41 1 0 0 0 0
25 29 1 0 0 0 0
26 32 1 0 0 0 0
28 31 1 0 0 0 0
29 49 1 1 0 0 0
30 31 1 0 0 0 0
30 50 1 1 0 0 0
30 32 1 0 0 0 0
31 51 1 6 0 0 0
32 58 1 6 0 0 0
33 44 1 0 0 0 0
33 67 1 6 0 0 0
35 38 1 0 0 0 0
35 59 1 1 0 0 0
36 40 1 0 0 0 0
36 45 2 0 0 0 0
37 38 1 0 0 0 0
37 60 1 1 0 0 0
39 44 1 0 0 0 0
41 43 1 0 0 0 0
47 61 1 0 0 0 0
48 62 1 0 0 0 0
50 63 1 0 0 0 0
51 64 1 0 0 0 0
M CHG 2 40 -1 71 1
M END
> <Formula>
C47H83NO17
> <FW>
934.1584 (916.1205+18.0379)
> <DSSTox_CID>
28909
> <SR-HSE>
0
$$$$
下面是一些来自文本文件的ID:
NCGC00015959-03
NCGC00168261-01
NCGC00257010-01
NCGC00254654-01
NCGC00254471-01
生成的SDF文件应该这样开始:
NCGC00015959-03
Marvin 07111412562D
25 30 0 0 0 0 999 V2000
3.4098 -1.3130 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0
4.8329 -1.3130 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.4098 -2.1380 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1248 -2.5436 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6948 -2.5436 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8329 -2.1380 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1248 -0.8937 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.5547 -0.8937 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9799 -2.1380 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6948 -3.3548 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2718 -2.5436 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2718 -3.3548 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1248 -3.3548 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9799 -3.7741 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.5547 -2.5436 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.2765 -1.3130 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7128 -0.0894 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.4881 -2.2755 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.4881 -3.6160 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.8746 -0.7562 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.5378 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -2.9423 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.4098 -3.7741 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.2765 -2.1380 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6948 -0.8937 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 3 1 0 0 0 0
1 7 2 0 0 0 0
1 25 1 0 0 0 0
2 7 1 0 0 0 0
2 6 2 0 0 0 0
2 8 1 0 0 0 0
3 4 2 0 0 0 0
3 5 1 0 0 0 0
4 13 1 0 0 0 0
4 6 1 0 0 0 0
5 9 1 0 0 0 0
5 10 2 0 0 0 0
6 15 1 0 0 0 0
8 16 2 0 0 0 0
8 17 1 0 0 0 0
9 11 2 0 0 0 0
10 14 1 0 0 0 0
10 23 1 0 0 0 0
11 18 1 0 0 0 0
11 12 1 0 0 0 0
12 14 2 0 0 0 0
12 19 1 0 0 0 0
13 23 2 0 0 0 0
15 24 2 0 0 0 0
16 20 1 0 0 0 0
16 24 1 0 0 0 0
17 21 1 0 0 0 0
18 22 1 0 0 0 0
19 22 1 0 0 0 0
20 21 1 0 0 0 0
M CHG 1 1 1
M END
> <Formula>
C20H14NO4
> <FW>
332.3289
> <DSSTox_CID>
25204
> <NR-AR>
0
> <NR-ER-LBD>
1
> <NR-AhR>
1
$$$$
NCGC00168261-01
Marvin 07111412562D
23 25 0 0 0 0 999 V2000
2.1236 -2.4895 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4205 -2.0662 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1236 -3.3074 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4205 -3.7235 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.7174 -2.4895 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7174 -3.3074 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8554 -2.0662 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -2.0662 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4205 -1.2412 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8554 -3.7235 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5656 -2.4895 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5656 -3.3074 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8554 -1.2412 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.7174 -0.8251 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -1.2412 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0430 -2.8984 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7174 -4.1324 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2902 -3.7378 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7174 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0292 -3.3145 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.4569 -3.3360 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7538 -3.7378 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.1743 -3.7378 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 2 0 0 0 0
1 7 1 0 0 0 0
2 5 2 0 0 0 0
2 9 1 0 0 0 0
3 4 1 0 0 0 0
3 10 1 0 0 0 0
4 6 1 0 0 0 0
5 8 1 0 0 0 0
5 6 1 0 0 0 0
6 16 1 0 0 0 0
6 17 1 0 0 0 0
7 11 2 0 0 0 0
7 13 1 0 0 0 0
8 15 2 0 0 0 0
9 14 2 0 0 0 0
10 12 2 0 0 0 0
11 12 1 0 0 0 0
12 18 1 0 0 0 0
14 15 1 0 0 0 0
14 19 1 0 0 0 0
18 20 1 0 0 0 0
20 22 1 0 0 0 0
21 22 1 0 0 0 0
21 23 1 0 0 0 0
M END
> <Formula>
C21H26O2
> <FW>
310.4299
> <DSSTox_CID>
28922
> <NR-AR>
0
> <NR-AhR>
1
> <SR-MMP>
1
$$$$
NCGC00257010-01
Marvin 07111412562D
35 37 0 0 0 0 999 V2000
2.0286 -3.5779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.0019 -7.8578 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.0019 -0.7019 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8589 -3.5779 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1.6092 -2.8589 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1.6092 -4.2799 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.2784 -4.2799 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
6.5825 -7.1217 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.5825 -1.4381 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3681 -3.5779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.5024 -3.5779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.5024 -4.9989 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.0915 -4.2799 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.3412 -3.5779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.3412 -4.9989 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7704 -4.2799 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7704 -2.8589 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.7294 -1.1385 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
6.2829 -0.2996 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
7.7294 -7.4213 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
7.4384 -8.5597 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
6.2829 -8.2601 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
7.4384 0.0000 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
7.0019 -2.1485 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.0019 -6.4112 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7607 -1.4381 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7607 -7.1217 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7607 -5.7008 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7607 -2.8589 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.5825 -5.7008 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.5825 -2.8589 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.3412 -6.4112 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.3412 -2.1485 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -2.9103 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0086 -4.2542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 4 2 0 0 0 0
1 5 1 0 0 0 0
1 6 1 0 0 0 0
2 8 1 0 0 0 0
2 20 1 0 0 0 0
2 21 1 0 0 0 0
2 22 1 0 0 0 0
3 9 1 0 0 0 0
3 18 1 0 0 0 0
3 19 1 0 0 0 0
3 23 1 0 0 0 0
4 7 1 0 0 0 0
5 17 1 0 0 0 0
6 16 1 0 0 0 0
7 13 2 0 0 0 0
8 27 1 0 0 0 0
8 25 2 0 0 0 0
9 26 2 0 0 0 0
9 24 1 0 0 0 0
10 16 1 0 0 0 0
10 34 1 0 0 0 0
10 35 1 0 0 0 0
10 17 1 0 0 0 0
11 13 1 0 0 0 0
11 14 2 0 0 0 0
12 13 1 0 0 0 0
12 15 2 0 0 0 0
14 29 1 0 0 0 0
15 28 1 0 0 0 0
24 31 2 0 0 0 0
25 30 1 0 0 0 0
26 33 1 0 0 0 0
27 32 2 0 0 0 0
28 30 2 0 0 0 0
28 32 1 0 0 0 0
29 31 1 0 0 0 0
29 33 2 0 0 0 0
M END
> <Formula>
C25H24F6N4
> <FW>
494.4753
> <DSSTox_CID>
3868
> <NR-AR>
0
> <NR-ER>
1
> <NR-AhR>
1
$$$$
我看过这篇文章:根据另一个文件中给出的ID,从SDF文件中按顺序提取分子,该文件在unix中提供了解决方案。我在命令行中使用了该解决方法:awk 'BEGIN{ORS="$$$$"}NR==FNR{a[$1]=$0;next}$1 in a' ids.txt RS="$" molecules.sdf > molecules_by_ids.sdf
,并且能够得到我想要的大部分。但是,即使使用此命令行选项,我也无法从SDF文件中提取100%的分子。例如,其中一个特征有981个阳性分子,文本文件获得981个ID,该命令在SDF文件中为我提供950个分子。
我真正想要的是一个MATLAB解决方案,它不会错过生成文件中的任何分子。我感谢为解决问题所作的任何努力。谢谢
我在MATLAB中找到的一个变通方法是下面的函数,其中"id";是ID TXT文件的名称;sdfs";是SDF数据库;sdf_name";是通过ID:提取分子的新SDF文件的名称
function write_sdf(id, sdfs, sdf_name)
% Open the text file of ids.
fid = fopen(id);
% Convert the sdf file to a character array.
data = fileread(sdfs);
% For each id, get the portion of the sdf file corresponding
% to the molecule id.
while true
mol_id = fgetl(fid);
mol_full = '';
% When we're at the end of the file, leave the loop.
if mol_id == -1
% We're done with the id file.
fclose(fid);
break;
else
mol_after = extractAfter(data, mol_id);
mol_between = extractBefore(mol_after, '$$$$');
mol_full = [char(mol_id) char(mol_between) '$$$$'];
% Write the molecule to the sdf file.
writelines(mol_full, sdf_name, WriteMode='append');
end
end
end
这个解决方案的问题是速度非常慢。如果有人知道更快的方法,请告诉我!现在,我将使用这个。