有效地预处理字符串列表

我有一个字符串列表。示例字符串，

mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies"

对于字符串（其中）中的每个术语，术语用逗号分隔，我想删除"/"之后的所有文本。如果没有反斜杠，则不执行任何操作。

例如，我希望生成的字符串是这样的，

 mesh = "Adrenergic beta-Antagonists, Adult, Aged, Aged, Antihypertensive Agents, Blood Glucose, Celiprolol, Female, Glucose Tolerance Test, Humans, Hypertension, Male, Middle Aged, Prospective Studies"

然后我想删除字符串中的任何重复值（例如。年纪大）。所需字符串，

mesh = "Adrenergic beta-Antagonists, Adult, Aged, Antihypertensive Agents, Blood Glucose, Celiprolol, Female, Glucose Tolerance Test, Humans, Hypertension, Male, Middle Aged, Prospective Studies"

我编写了适用于一个字符串的代码，但是正在寻找一种更有效的方法来为字符串列表执行此操作：

import string
mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies"
newMesh = []
for each in mesh.split(","):
    newMesh.append(each.split('/', 1)[0].lstrip(' '))
newMesh = list(set(newMesh))
meshString = ",".join(newMesh)
print(meshString)

注意：字符串中术语的顺序无关紧要。

您可以使用

re.sub：

mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies"
import re
s = re.sub("/*[ws]+", '', mesh)
final_string = []
for i in re.split(",", s):
    if i not in final_string:
        final_string.append(i)
new_final_string = ', '.join(final_string)
print(new_final_string)

输出：

'Adrenergic beta-Antagonists,  Adult,  Aged,  Antihypertensive Agents,  Blood Glucose,  Celiprolol,  Female,  Glucose Tolerance Test,  Humans,  Hypertension,  Male,  Middle Aged,  Prospective Studies'

使用re.sub()函数和set对象（用于更快的项目搜索）：

import re
mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies"
word_set = set()
result = []
for w in re.sub(r'/[^,]+', '', mesh).split(','):
    w = w.strip()
    if w not in word_set:
        result.append(w)
        word_set.add(w)
result = ', '.join(result)
print(result)

输出：

Adrenergic beta-Antagonists, Adult, Aged, Antihypertensive Agents, Blood Glucose, Celiprolol, Female, Glucose Tolerance Test, Humans, Hypertension, Male, Middle Aged, Prospective Studies

相关内容

最新更新

热门标签：