我正试图获得基于此过程的输出,最好用一个例子来解释。
例如在微笑中,
C(N)(N)CC(N)C,[0,1,2,0,0,1,0]
这是我想要得到的输出。
它计算分支(用括号表示)。因此,对于上面的例子,它将第一个(N)计数为1,然后将第二个(N)计数为2。一旦到达未分支(或括号内)的原子,该计数就会重置。它继续为0,计数开始并再次重置。问题是我没有得到预期的产出。以下是我的输出、预期输出和代码。感谢
此外,我需要确保像这些CC(CC(C))这样的情况不会被错误地索引。它不应计算超出和不重置,不应连续计数。那个微笑应该有输出[0 0 1 1 1]。
另一个例子:抄送[0 0 1 1 1 0 0 0]
对于嵌套括号,我将重新运行此过程,并从1开始计数。
我得到这个
SMILES branch_count
0 C(N)(N)CC(N)C [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 0, 0, 0, 1, 0, 0]
4 CC(C)C [0, 0, 0, 1, 0, 0]
什么时候应该是这个
SMILES branch_count
0 C(N)(N)CC(N)C [0, 1, 2, 0, 0, 1, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 1, 0]
4 CC(C)C [0, 0, 1, 0]
import pandas as pd
import numpy as np
from rdkit import Chem
def get_branch_count(smile):
# Initialize variables
branch_count = [0] * len(smile)
bracket_count = 0
current_count = 0
# Loop through each character in the smile
for i, c in enumerate(smile):
# If the character is an open bracket, increment bracket count
if c == "(":
bracket_count += 1
# If the character is a close bracket, decrement bracket count
elif c == ")":
bracket_count -= 1
# If there are no more open brackets after this one, reset current count
if bracket_count == 0:
current_count = 0
# If the character is not a bracket, update the current count
else:
if bracket_count > 0:
# If the previous character was also a bracket, don't increment the count
if smile[i-1] != ")":
current_count += 1
else:
current_count = 0
branch_count[i] = current_count
return branch_count
def collect_branch_count(smile_list):
rows = []
for smile in smile_list:
branch_count = get_branch_count(smile)
data = {"branch_count": branch_count}
row = {"SMILES": smile}
for key, value in data.items():
row[key] = value
rows.append(row)
df = pd.DataFrame(rows)
return df
smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)
循环将括号作为字符包括在内,因此对于每个开括号和闭括号,您的代码都将其视为一个原子。您应该使用.isalpha()
检查字符是否为字母。然后,您还应该检查(我的是n
)字符是否应该被数字替换。例如,在你的坏代码中,括号和数字也被0/1取代,这意味着你有多余的原子,你不想要。阅读我的评论以获得更多解释,并在您自己的引擎中运行此代码以确保它是正确的(尽管我已经检查了多次)。
import pandas as pd
import numpy as np
from rdkit import Chem
# All changes in function
def get_branch_count(smile):
# Initialize variables
n = 0 # This is to make sure that only the needed characters are added, so it doesn't include
length_smile = 0
for char in smile:
if char.isalpha():
length_smile += 1
branch_count = [0] * length_smile
bracket_count = 0
bracket_together = 0 # Use this variable for when the brackets are next to each other for less confusing code
current_count = 0
# Loop through each character in the smile
for i, c in enumerate(smile):
if c == '(':
bracket_count += 1
# Continue after the IF statement because the letters are now inside of the brackets
elif bracket_count >= 1 and c.isalpha():
current_count = bracket_count
branch_count[n] = current_count
n += 1
# This is to check if there are consecutive branches
elif c ==')':
if smile[i+1] != '(':
bracket_count = 0
# If the character is not surrounded by brackets and if it is alphabetical
elif c.isalpha() and bracket_count == 0:
current_count = 0
branch_count[n] = current_count # Do this inside of each IF statement for the alphabetical chars so that it doesn't include the brackets
n += 1
return branch_count
def collect_branch_count(smile_list):
rows = []
for smile in smile_list:
branch_count = get_branch_count(smile)
data = {"branch_count": branch_count}
row = {"SMILES": smile}
for key, value in data.items():
row[key] = value
rows.append(row)
df = pd.DataFrame(rows)
return df
smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)
正如你所看到的,我改变了一些东西:
我没有执行
branch_count = [0] * len(smile)
,而是将其更改为:```python # This is to make sure that there are no extra numbers (for example the brackets and the non-alphabetical characters. length_smile = 0 for char in smile: if char.isalpha(): length_smile += 1 branch_count = [0] * length_smile ```
这是我的解决方案。
首先,我将所有的C1
替换为C
,以评估一个字母作为可选组。然后我数开括号。如果只有一个背景是开放的,我有一个新的小组。如果我有一个右括号,我会检查下一个字母是左括号,以检查是否有连续的一组。如果没有,我将计数器重置为0。
import pandas as pd
def smile_grouping(s):
s = s.replace('C1', 'C')
open_brackets = 0
group_counter = 0
res = []
for i, letter in enumerate(s):
if letter == '(':
open_brackets += 1
if open_brackets == 1:
group_counter += 1
elif letter == ')':
open_brackets -= 1
else:
res.append(group_counter)
if open_brackets == 0:
if i+1<len(s) and s[i+1] != '(':
group_counter = 0
return res
这是的结果
df = pd.DataFrame(
{'smile':[
"C(N)(N)CC(N)C",
"CCC",
"C1CC1",
"C1CC1(C)C",
"CC(C)C",
"C(N)(N)(N)CC(N)C",
"C((N)(N)N)CC(N)C",
"CC(CCC)CCCC",
"CC(CC(C))"
]})
df['branch_count'] = df['smile'].apply(smile_grouping)
>>> df
smile branch_count
0 C(N)(N)CC(N)C [0, 1, 2, 0, 0, 1, 0]
1 CCC [0, 0, 0]
2 C1CC1 [0, 0, 0]
3 C1CC1(C)C [0, 0, 0, 1, 0]
4 CC(C)C [0, 0, 1, 0]
5 C(N)(N)(N)CC(N)C [0, 1, 2, 3, 0, 0, 1, 0]
6 C((N)(N)N)CC(N)C [0, 1, 1, 1, 0, 0, 1, 0]
7 CC(CCC)CCCC [0, 0, 1, 1, 1, 0, 0, 0, 0]
8 CC(CC(C)) [0, 0, 1, 1, 1]