如何使用[nn]模式将数据拆分为句子



问题:使用[nn]模式将txt拆分为多个句子。最终结果是10句话。Lookbacking给了我10个句子,但我漏掉了句子中的[nn]。向前看给我9句话,我错过了最后一句。我需要在句子中加入模式[nn]

txt="[01] Final Step - Protonica [02] Liquid Frequencies (Liquid Soul Mix) - Liquid Soul [03] Global Illumination - Liquid Soul [04] Devotion - Liquid Soul [05] Black Rock City - Quantize [06] Plazza Del Trripy - Andromeda [07] Private Guide - Suntree [08] Stereo Gun - Vibrasphere [09] The Cycle - Ritree [10] Atmonizer - Andromed"

我用了一个前瞻来寻找匹配。我漏掉了最后一句话。

print(".+? is the ungreedy character match")
#print("(?<=[d{2}]) is the lookbehind character match")
print("(?=[d{2}]) is the lookforward character match")
#pattern=r"(?<=[d{2}]).+?(?=[d{2}])"
pattern=r".+?(?=[d{2}])"
matches=re.findall(pattern,txt)
for match in matches:
print("output",match)
output:
output [01] Final Step - Protonica 
output [02] Liquid Frequencies (Liquid Soul Mix) - Liquid Soul 
output [03] Global Illumination - Liquid Soul 
output [04] Devotion - Liquid Soul 
output [05] Black Rock City - Quantize 
output [06] Plazza Del Trripy - Andromeda 
output [07] Private Guide - Suntree 
output [08] Stereo Gun - Vibrasphere 
output [09] The Cycle - Ritree 

缺少

output [10] Atmonizer - Andromed

我使用lookahead查找与|$匹配的项,以查找最后一句

print(".+? is the ungreedy character match")
print("(?=[d{2}]) is the lookforward character match")
pattern="[d{2}].+?(?=[d{2}]|$)"
matches=re.findall(pattern,txt)
for match in matches:
print("output",match)

输出:

output [01] Final Step - Protonica 
output [02] Liquid Frequencies (Liquid Soul Mix) - Liquid Soul 
output [03] Global Illumination - Liquid Soul 
output [04] Devotion - Liquid Soul 
output [05] Black Rock City - Quantize 
output [06] Plazza Del Trripy - Andromeda 
output [07] Private Guide - Suntree 
output [08] Stereo Gun - Vibrasphere 
output [09] The Cycle - Ritree 
output [10] Atmonizer - Andromed

通常,您可以避免前瞻,只查找字符,直到而不是下一个

>>> re.findall(r"[[^[]+", txt)
['[01] Final Step - Protonica ', '[02] Liquid Frequencies (Liquid Soul Mix) - Liquid Soul ', '[03] Global Illumination - Liquid Soul ', '[04] Devotion - Liquid Soul ', '[05] Black Rock City - Quantize ', '[06] Plazza Del Trripy - Andromeda ', '[07] Private Guide - Suntree ', '[08] Stereo Gun - Vibrasphere ', '[09] The Cycle - Ritree ', '[10] Atmonizer - Andromed']

这是通过找到[,然后贪婪地找到任何而不是[的字符(开始下一个块(来实现的

这种方法适用于您当前的输入,但如果您希望在分组中找到更多的[字符,则不起作用;在这种情况下,您应该使用-搜索整个块,或者更正输入,使其为您去除中的这些字符

最新更新