字符串索引



我正试图获得基于此过程的输出,最好用一个例子来解释。

例如在微笑中,

C(N)(N)CC(N)C,[0,1,2,0,0,1,0]
这是我想要得到的输出。

它计算分支(用括号表示)。因此,对于上面的例子,它将第一个(N)计数为1,然后将第二个(N)计数为2。一旦到达未分支(或括号内)的原子,该计数就会重置。它继续为0,计数开始并再次重置。问题是我没有得到预期的产出。以下是我的输出、预期输出和代码。感谢

此外,我需要确保像这些CC(CC(C))这样的情况不会被错误地索引。它不应计算超出和不重置,不应连续计数。那个微笑应该有输出[0 0 1 1 1]。

另一个例子:抄送[0 0 1 1 1 0 0 0]

对于嵌套括号,我将重新运行此过程,并从1开始计数。

我得到这个

SMILES                             branch_count
0  C(N)(N)CC(N)C  [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
1            CCC                                [0, 0, 0]
2          C1CC1                          [0, 0, 0, 0, 0]
3      C1CC1(C)C              [0, 0, 0, 0, 0, 0, 1, 0, 0]
4         CC(C)C                       [0, 0, 0, 1, 0, 0]

什么时候应该是这个

SMILES        branch_count
0  C(N)(N)CC(N)C  [0, 1, 2, 0, 0, 1, 0]
1            CCC           [0, 0, 0]
2          C1CC1           [0, 0, 0]
3      C1CC1(C)C        [0, 0, 0, 1, 0]
4         CC(C)C           [0, 0, 1, 0]

import pandas as pd
import numpy as np
from rdkit import Chem
def get_branch_count(smile):
# Initialize variables
branch_count = [0] * len(smile)
bracket_count = 0
current_count = 0

# Loop through each character in the smile
for i, c in enumerate(smile):
# If the character is an open bracket, increment bracket count
if c == "(":
bracket_count += 1
# If the character is a close bracket, decrement bracket count
elif c == ")":
bracket_count -= 1
# If there are no more open brackets after this one, reset current count
if bracket_count == 0:
current_count = 0
# If the character is not a bracket, update the current count
else:
if bracket_count > 0:
# If the previous character was also a bracket, don't increment the count
if smile[i-1] != ")":
current_count += 1
else:
current_count = 0
branch_count[i] = current_count

return branch_count
def collect_branch_count(smile_list):
rows = []
for smile in smile_list:
branch_count = get_branch_count(smile)
data = {"branch_count": branch_count}
row = {"SMILES": smile}
for key, value in data.items():
row[key] = value
rows.append(row)
df = pd.DataFrame(rows)
return df
smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)

循环将括号作为字符包括在内,因此对于每个开括号和闭括号,您的代码都将其视为一个原子。您应该使用.isalpha()检查字符是否为字母。然后,您还应该检查(我的是n)字符是否应该被数字替换。例如,在你的坏代码中,括号和数字也被0/1取代,这意味着你有多余的原子,你不想要。阅读我的评论以获得更多解释,并在您自己的引擎中运行此代码以确保它是正确的(尽管我已经检查了多次)。

import pandas as pd
import numpy as np
from rdkit import Chem

# All changes in function
def get_branch_count(smile):
# Initialize variables
n = 0 # This is to make sure that only the needed characters are added, so it doesn't include 
length_smile = 0
for char in smile:
if char.isalpha():
length_smile += 1
branch_count = [0] * length_smile
bracket_count = 0
bracket_together = 0 # Use this variable for when the brackets are next to each other for less confusing code
current_count = 0
# Loop through each character in the smile
for i, c in enumerate(smile):
if c == '(':
bracket_count += 1

# Continue after the IF statement because the letters are now inside of the brackets
elif bracket_count >= 1 and c.isalpha():
current_count = bracket_count
branch_count[n] = current_count
n += 1
# This is to check if there are consecutive branches
elif c ==')':
if smile[i+1] != '(':
bracket_count = 0


# If the character is not surrounded by brackets and if it is alphabetical
elif c.isalpha() and bracket_count == 0:
current_count = 0
branch_count[n] = current_count # Do this inside of each IF statement for the alphabetical chars so that it doesn't include the brackets
n += 1

return branch_count
def collect_branch_count(smile_list):
rows = []
for smile in smile_list:
branch_count = get_branch_count(smile)
data = {"branch_count": branch_count}
row = {"SMILES": smile}
for key, value in data.items():
row[key] = value
rows.append(row)
df = pd.DataFrame(rows)
return df
smile_list = ["C(N)(N)CC(N)C", "CCC", "C1CC1", "C1CC1(C)C", "CC(C)C"]
df = collect_branch_count(smile_list)
print(df)

正如你所看到的,我改变了一些东西:

  • 我没有执行branch_count = [0] * len(smile),而是将其更改为:

    ```python
    # This is to make sure that there are no extra numbers (for example the brackets and the non-alphabetical characters.
    length_smile = 0
    for char in smile:
    if char.isalpha():
    length_smile += 1
    branch_count = [0] * length_smile
    ```
    

这是我的解决方案。

首先,我将所有的C1替换为C,以评估一个字母作为可选组。然后我数开括号。如果只有一个背景是开放的,我有一个新的小组。如果我有一个右括号,我会检查下一个字母是左括号,以检查是否有连续的一组。如果没有,我将计数器重置为0。

import pandas as pd
def smile_grouping(s):
s = s.replace('C1', 'C')
open_brackets = 0
group_counter = 0
res = []
for i, letter in enumerate(s):
if letter == '(':
open_brackets += 1
if open_brackets == 1:
group_counter += 1
elif letter == ')':
open_brackets -= 1
else:
res.append(group_counter)
if open_brackets == 0:
if i+1<len(s) and s[i+1] != '(':
group_counter = 0
return res

这是的结果

df = pd.DataFrame(
{'smile':[
"C(N)(N)CC(N)C",
"CCC",
"C1CC1",
"C1CC1(C)C",
"CC(C)C",
"C(N)(N)(N)CC(N)C",
"C((N)(N)N)CC(N)C",
"CC(CCC)CCCC",
"CC(CC(C))"
]})
df['branch_count'] = df['smile'].apply(smile_grouping)
>>> df
smile                 branch_count
0     C(N)(N)CC(N)C        [0, 1, 2, 0, 0, 1, 0]
1               CCC                    [0, 0, 0]
2             C1CC1                    [0, 0, 0]
3         C1CC1(C)C              [0, 0, 0, 1, 0]
4            CC(C)C                 [0, 0, 1, 0]
5  C(N)(N)(N)CC(N)C     [0, 1, 2, 3, 0, 0, 1, 0]
6  C((N)(N)N)CC(N)C     [0, 1, 1, 1, 0, 0, 1, 0]
7       CC(CCC)CCCC  [0, 0, 1, 1, 1, 0, 0, 0, 0]
8         CC(CC(C))              [0, 0, 1, 1, 1]

最新更新