python正则表达式从文本中省略复杂的引用样式



我已经将一个文件的内容读取到python中,我想去掉所有遵循相同通用格式的引用:

(Author et al., .............. nGoogle Scholar) # there could be many 'nGoogle Scholar's within the brackets

引言Langerhans分泌胰岛素和胰高血糖素以响应葡萄糖维持葡萄糖稳态的扰动。胰岛素分泌β细胞表现出形态、功能和分子变异,表明它们可能由具有专门任务和生理反应(Gutierrez等人。,2017 Gutierrez G.D.Gromada J.Sussel L胰腺β细胞。正面Genet。2017年;8:22交叉引用\nSubMed\nCopus(11) \nGoogle Scholar,Roscioni等人。,2016 Roscioni S.S.Migliorini A。Gegg M.Lickert H.胰岛结构对细胞的影响异质性、可塑性和功能。内分泌学国家版。2016年;12:695-709 Crossref\nPubMed\nCopus(36)\nGoogle Scholar)。的特点β细胞的异质性包括葡萄糖反应性和分泌性活动然而,胰腺中转录物的可视化如果不使用诸如光开关染料(崔等,2018)。Chrisler W.B.Gaffrey M.J.Ansong C.Sussel L.Orr G.波动基于定位成像的荧光原位杂交(fliFISH)用于准确检测和计数单个RNA拷贝细胞。核酸研究2018;46:e7交叉引用\nPubMed\nCopus(2) \nGoggle学者)。我们已经优化了标准组织smFISH协议(Lyubimova等人,2013 Lyubimov A.Itzkovitz S.Junker J.P。Fan Z.P.Wu X.van Oudenarden A.单分子信使核糖核酸检测和在哺乳动物组织中计数。Nat.Protoc。2013年;8:1743-1758 Crossref\nPubMed\nCopus(62)\nGoogle Scholar)显著增加了信使核糖核酸变性的时间在探针杂交步骤之前,从5分钟至至少3小时。

所需输出

引言Langerhans分泌胰岛素和胰高血糖素以响应葡萄糖维持葡萄糖稳态的扰动。胰岛素分泌β细胞表现出形态、功能和分子变异,表明它们可能由具有专业任务和生理反应。β细胞的特征异质性包括葡萄糖反应性和分泌活性……然而,胰腺中转录物的可视化如果不使用诸如光开关染料。我们已经优化了标准组织smFISH通过显著增加mRNA变性的时间,在探针杂交步骤之前,从5分钟到至少3小时。

我找不到一个一次性省略所有引用的正则表达式,所以我不得不分两部分来完成:

  1. 查找每一个"Goggle Scholar)"事件的所有位置
  2. 从每个位置向后延伸,直到出现相应的左括号,然后省略这些索引之间的字符

我尝试如下:

def remove(test_str):
regex=re.compile('\nGoogle Scholar)')
starts=[]
ends=[]
ret=''
for end in regex.finditer(test_str): #find all 'Google Scholar)'
ends.append(m.end())
for e in ends:                       #find all starting brackets
i=e
while True:
if bool(re.match('(D+',test_str[i-2:i])):
starts.append(i-2)
break
else:
i-=1
start=test_str[:starts[0]]           #omit all characters in between
starts=starts[1:]
end=test_str[ends[-1]:]
ends=ends[:-1]
for i,j in zip(starts,ends):
ret=ret+test_str[j:i]
return start+ret+end

然而,这种策略失败了,因为我用来查找每个起始括号((D+)的正则表达式不够精确——引用中经常有闭括号,例如

(Cui等人,2018)Cui Y.Hu D.Markillie L.M.Chrisler W.B.Gaffrey M.J。基于波动定位成像的Ansong C.Sussel L.Orr G荧光原位杂交(fliFISH)用于准确检测和单细胞中RNA拷贝的计数。核酸研究2018;46:e7Crossref\nSubMed\nCopus(2)\nGoogle Scholar)

因此,在这种情况下,搜索正确的左括号会提前停止。。。。

有人能推荐一种持续删除所有引用的好方法吗?

根据您描述的模式,您可以使用此regex,

(?s)(.*?Google Scholar) ?

并将其替换为空字符串。这里CCD_ 2用于使CCD_。

点击此处

这是一个python代码演示,

import re
s = 'Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses (Gutierrez etal., 2017Gutierrez G.D. Gromada J. Sussel L. Heterogeneity of the pancreatic beta cell.Front. Genet. 2017; 8: 22CrossrefnPubMednScopus (11)nGoogle Scholar, Roscioni etal., 2016Roscioni S.S. Migliorini A. Gegg M. Lickert H. Impact of islet architecture on -cell heterogeneity, plasticity and function.Nat. Rev. Endocrinol. 2016; 12: 695-709CrossrefnPubMednScopus (36)nGoogle Scholar). Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes (Cui etal., 2018Cui Y. Hu D. Markillie L.M. Chrisler W.B. Gaffrey M.J. Ansong C. Sussel L. Orr G. Fluctuation localization imaging-based fluorescence insitu hybridization (fliFISH) for accurate detection and counting of RNA copies in single cells.Nucleic Acids Res. 2018; 46: e7CrossrefnPubMednScopus (2)nGoogle Scholar). We have optimized the standard tissue smFISH protocol (Lyubimova etal., 2013Lyubimova A. Itzkovitz S. Junker J.P. Fan Z.P. Wu X. van Oudenaarden A. Single-molecule mRNA detection and counting in mammalian tissue.Nat. Protoc. 2013; 8: 1743-1758CrossrefnPubMednScopus (62)nGoogle Scholar) by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr.'
replacedStr = re.sub(r'(?s)(.*?Google Scholar) ?','',s)
print(replacedStr)

打印你在帖子中提到的以下内容。

引言Langerhans分泌胰岛素和胰高血糖素以响应葡萄糖维持葡萄糖稳态的扰动。胰岛素分泌β细胞表现出形态、功能和分子变异,表明它们可能由具有专业任务和生理反应。β细胞的特征异质性包括葡萄糖反应性和分泌活性……然而,胰腺中转录物的可视化如果不使用诸如光开关染料。我们已经优化了标准组织smFISH通过显著增加mRNA变性的时间,在探针杂交步骤之前,从5分钟到至少3小时。

import re
if __name__ == '__main__':
source = """Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses (Gutierrez etal., 2017Gutierrez G.D. Gromada J. Sussel L. Heterogeneity of the pancreatic beta cell.Front. Genet. 2017; 8: 22CrossrefnPubMednScopus (11)nGoogle Scholar, Roscioni etal., 2016Roscioni S.S. Migliorini A. Gegg M. Lickert H. Impact of islet architecture on -cell heterogeneity, plasticity and function.Nat. Rev. Endocrinol. 2016; 12: 695-709CrossrefnPubMednScopus (36)nGoogle Scholar). Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes (Cui etal., 2018Cui Y. Hu D. Markillie L.M. Chrisler W.B. Gaffrey M.J. Ansong C. Sussel L. Orr G. Fluctuation localization imaging-based fluorescence insitu hybridization (fliFISH) for accurate detection and counting of RNA copies in single cells.Nucleic Acids Res. 2018; 46: e7CrossrefnPubMednScopus (2)nGoogle Scholar). We have optimized the standard tissue smFISH protocol (Lyubimova etal., 2013Lyubimova A. Itzkovitz S. Junker J.P. Fan Z.P. Wu X. van Oudenaarden A. Single-molecule mRNA detection and counting in mammalian tissue.Nat. Protoc. 2013; 8: 1743-1758CrossrefnPubMednScopus (62)nGoogle Scholar) by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr."""
output = re.sub(' (.*? etal., .*?\nGoogle Scholar)', '', source, flags=re.DOTALL)
print(output)

输出

Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses. Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes. We have optimized the standard tissue smFISH protocol by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr.

我会用以下方式解决它,它与你想要的字母相匹配,并且可以处理文本中的括号(不是引用):

  1. 查找开始的(

  2. 查找[^()]+(?:([^()]+))?的重复,即一个或多个不是圆括号的字符,然后是一对可选的( ),其中一个或更多个字符不是圆括号。

  3. 寻找结束nGoogle Scholar)

  4. 拆分并连接空格以删除多个空格

代码:

import re
text = 'Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses (Gutierrez etal., 2017Gutierrez G.D. Gromada J. Sussel L. Heterogeneity of the pancreatic beta cell.Front. Genet. 2017; 8: 22CrossrefnPubMednScopus (11)nGoogle Scholar, Roscioni etal., 2016Roscioni S.S. Migliorini A. Gegg M. Lickert H. Impact of islet architecture on -cell heterogeneity, plasticity and function.Nat. Rev. Endocrinol. 2016; 12: 695-709CrossrefnPubMednScopus (36)nGoogle Scholar). Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes (Cui etal., 2018Cui Y. Hu D. Markillie L.M. Chrisler W.B. Gaffrey M.J. Ansong C. Sussel L. Orr G. Fluctuation localization imaging-based fluorescence insitu hybridization (fliFISH) for accurate detection and counting of RNA copies in single cells.Nucleic Acids Res. 2018; 46: e7CrossrefnPubMednScopus (2)nGoogle Scholar). We have optimized the standard tissue smFISH protocol (Lyubimova etal., 2013Lyubimova A. Itzkovitz S. Junker J.P. Fan Z.P. Wu X. van Oudenaarden A. Single-molecule mRNA detection and counting in mammalian tissue.Nat. Protoc. 2013; 8: 1743-1758CrossrefnPubMednScopus (62)nGoogle Scholar) by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr.'
fixed_text = ' '.join(re.sub(r'((?:[^()]+(?:([^()]+))?)+nGoogle Scholar)', '', text).split())
print(fixed_text)

输出:

引言Langerhans分泌胰岛素和胰高血糖素以响应葡萄糖维持葡萄糖稳态的扰动。胰岛素分泌β细胞表现出形态、功能和分子变异,表明它们可能由具有专业任务和生理反应。β细胞的特征异质性包括葡萄糖反应性和分泌活性……然而,胰腺中转录物的可视化如果不使用诸如光开关染料。我们已经优化了标准组织smFISH通过显著增加mRNA变性的时间,在探针杂交步骤之前,从5分钟到至少3小时。

可以通过更改为以下代码来进行改进,该代码还删除了前导(之前的空格,但它与您想要的输出不匹配(存在缺陷):

fixed_text = re.sub(r' ?((?:[^()]+(?:([^()]+))?)+nGoogle Scholar)', '', string)

引言Langerhans分泌胰岛素和胰高血糖素以响应葡萄糖维持葡萄糖稳态的扰动。胰岛素分泌β细胞表现出形态、功能和分子变异,表明它们可能由具有专业任务和生理反应。β细胞的特征异质性包括葡萄糖反应性和分泌活性……然而,胰腺中转录物的可视化如果不使用诸如光开关染料。我们已经优化了标准组织smFISH通过显著增加mRNA变性的时间,在探针杂交步骤之前,从5分钟到至少3小时。

最新更新