如何使用RDKit计算SMILE结构列表的分子指纹和相似性



我使用RDKit基于两个具有SMILE结构的分子列表之间的Tanimoto系数来计算分子相似性。现在,我可以从两个单独的csv文件中提取SMILE结构。我想知道如何将这些结构放入RDKit中的指纹模块中,以及如何逐个计算两个分子列表之间的相似性?

from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
ms = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO'), ... Chem.MolFromSmiles('COC')]
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
DataStructs.FingerprintSimilarity(fps[0],fps[1])

我想把我拥有的所有SMILE结构(超过10000个(都放在"ms"列表中,并获取它们的指纹。然后我将比较两个列表中每对分子之间的相似性,也许这里需要一个for循环?

提前感谢!

我使用pandas数据帧选择并打印出带有我的结构的列表,并将我的列表保存到list_1和list_2中。当它运行到ms1行时,它会出现以下错误:

TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string<wchar_t, 
std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type float

然后我检查了文件,在SMILES列中只有SMILES。但当我手动将一些分子结构放入列表进行测试时,仍然存在错误

fpArgs['minSize']. 

例如,钆二胺的SMILES为"O=C1[O-][Gd+3]234567[O]=C(C[N]2(CC[N]3(CC([O-]4(=O(CC[N]5(CC(=[O]6(NC(CC(=O([O-]7(C1(NC",错误代码如下(运行fps行时(:

ArgumentError: Python argument types in
rdkit.Chem.rdmolops.RDKFingerprint(NoneType, int, int, int, int, int, float, int)
did not match C++ signature:
RDKFingerprint(RDKit::ROMol mol, unsigned int minPath=1, 
unsigned int maxPath=7, unsigned int fpSize=2048, unsigned int nBitsPerHash=2, 
bool useHs=True, double tgtDensity=0.0, unsigned int minSize=128, bool branchedPaths=True, 
bool useBondOrder=True, boost::python::api::object atomInvariants=0, boost::python::api::object fromAtoms=0, 
boost::python::api::object atomBits=None, boost::python::api::object bitInfo=None).

如果原始csv文件如下,如何在输出文件中包括分子名称以及相似性值:

姓名,微笑,价值,价值2

分子1,CCOCN(C((C(,0.25,

分子2,CCO,1.12,B

分子3,COC,2.25,C

我添加了这些代码以在输出文件中包括分子名称,这是关于名称的一些数组值错误(特别是对于d2(:

name_1 = df_1['id1']
name_2 = df_2['id2']
name_3 = pd.concat([name_1, name_2])
# create a list for the dataframe
d1, qu, d2, ta, sim = [], [], [], [], []
for n in range(len(fps)-1): 
s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) 
#print(c_smiles[n], c_smiles[n+1:])
for m in range(len(s)):
qu.append(c_smiles[n])
ta.append(c_smiles[n+1:][m])
sim.append(s[m])
d1.append(name_3[n])
d2.append(name_3[n+1:][m])
#print()
d = {'ID_1':d1, 'query':qu, 'ID_2':d2, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
for index, row in df.iterrows():
print (row["ID_1"], row["query"], row["ID_2"], row["target"], row["Similarity"])
print(df_final)
# save as csv
df_final.to_csv('RESULT_3.csv', index=False, sep=',')

编辑答案以捕获所有评论

RDKit对相似性有很大的恐惧,所以你可以将一个指纹与一系列指纹进行比较。只需循环浏览指纹列表。

如果CSV看起来像这个

具有无效SMILES 的第一个csv

smiles,value,value2
CCOCN(C)(C),0.25,A
CCO,1.12,B
COC,2.25,C

具有正确SMILES 的第二个csv

smiles,value,value2
CCOCC,0.55,D
CCCO,2.58,E
CCCCO,5.01,F

这就是如何读出SMILES,删除无效的,进行指纹相似性而不重复,并保存排序后的值。

from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
import pandas as pd
# read and Conconate the csv's
df_1 = pd.read_csv('first.csv')
df_2 = pd.read_csv('second.csv')
df_3 = pd.concat([df_1, df_2])
# proof and make a list of SMILES
df_smiles = df_3['smiles']
c_smiles = []
for ds in df_smiles:
try:
cs = Chem.CanonSmiles(ds)
c_smiles.append(cs)
except:
print('Invalid SMILES:', ds)
print()
# make a list of mols
ms = [Chem.MolFromSmiles(x) for x in c_smiles]
# make a list of fingerprints (fp)
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
# the list for the dataframe
qu, ta, sim = [], [], []
# compare all fp pairwise without duplicates
for n in range(len(fps)-1): # -1 so the last fp will not be used
s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp
print(c_smiles[n], c_smiles[n+1:]) # witch mol is compared with what group
# collect the SMILES and values
for m in range(len(s)):
qu.append(c_smiles[n])
ta.append(c_smiles[n+1:][m])
sim.append(s[m])
print()
# build the dataframe and sort it
d = {'query':qu, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
print(df_final)
# save as csv
df_final.to_csv('third.csv', index=False, sep=',')

打印输出:

Invalid SMILES: CCOCN(C)(C)C
CCO ['COC', 'CCOCC', 'CCCO', 'CCCCO']
COC ['CCOCC', 'CCCO', 'CCCCO']
CCOCC ['CCCO', 'CCCCO']
CCCO ['CCCCO']
query target  Similarity
9   CCCO  CCCCO    0.769231
2    CCO   CCCO    0.600000
1    CCO  CCOCC    0.500000
7  CCOCC   CCCO    0.466667
3    CCO  CCCCO    0.461538
8  CCOCC  CCCCO    0.388889
4    COC  CCOCC    0.333333
5    COC   CCCO    0.272727
0    CCO    COC    0.250000
6    COC  CCCCO    0.214286

最新更新