在 Python 中使用两个条件(一个分隔符和一个"contain")拆分字符串



考虑以下字符串:

my_text = """
My favorites books of all time are:
Harry potter by JK Rowling,
Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""

我想提取书名和作者,所以预期的输出是:

output = [
['Harry Potter', 'JK Rowling'],
['Dune (first book)', 'Frank Herbert'],
['and Le Petit Prince', 'Antoine de Saint Exupery']
]

基本的两步方法是:

  • 使用re.split对非ascii字符列表((),;n等)进行分割,以提取句子或至少是句子的片段。
  • 只保留包含'by'的字符串,并再次使用'by'分隔标题和作者。

虽然该方法可以覆盖90%的情况,但主要问题是要考虑括号():我想在书名中保留它们(如Dune),但在作者之后使用它们作为分隔符(如Saint Exupery)。

我怀疑一个强大的正则表达式将涵盖这两个,但不确定具体如何

我不确定这是否是"一个强大的正则表达式",但它确实起到了作用:

import re
text = """
My favorites books of all time are:
Harry potter by JK Rowling,
Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""
pattern = r" *(.+) by ((?: ?w+)+)"
matches = re.findall(pattern, text)
res = []
for match in matches:
res.append((match[0], match[1]))
print(res) # [('Harry potter', 'JK Rowling'), ('Dune (first book)', 'Frank Herbert'), ('and Le Petit Prince', 'Antoine de Saint Exupery')]

按行分隔后:

lines = my_text.splitlines()

然后您可以在每行上使用一个正则表达式,例如([A-Z0-9].*?) by ([a-zA-Z' -]+)

这将匹配一个大写字母(或数字),后面跟着任何字符,直到遇到by 。大写字母或数字是为了避免与"one_answers"在最后一行的开头,因为我认为大多数书都以数字或大写字母开头。

by 之后,regex试图匹配包含字母,撇号,空格和破折号的所有内容,因为我猜它应该匹配大多数英语名称。您可以随意添加更多字符,如重音或不同的字母。

您可以使用're'模块创建一个强正则表达式来派生所需的答案。我用易于理解的代码编写了它。如果您更进一步,您可以编写自己的简化代码。

import re
my_text = """
My favorites books of all time are:
Harry potter by JK Rowling,
Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""
output = []
for sentence in re.split(r'[;,n]', my_text):
match = re.search(r'(.*)sbys(.*)', sentence)
if match:
match_group_2 = re.sub(r's*(.*)', '', match.group(2))
output.append([match.group(1).strip(), match_group_2.strip()])

print(output)
[
['Harry potter', 'JK Rowling'], 
['Dune (first book)', 'Frank Herbert'], 
['and Le Petit Prince', 'Antoine de Saint Exupery.']
]

谢谢。

我会这样做:

import re
my_text = """
My favorites books of all time are:
Harry potter by JK Rowling,
Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""
lines = my_text.splitlines()
for line in lines:
if " by " in line:
line = re.sub("[!?:;,#*@]", "", line)
title, author = line.split(" by ", 1)
if "(" in author:
author = author.split("(", 1)[0]
print(f"{title}, {author}")

与短正则表达式匹配:

books = [[t, a.strip()] for t, a in re.findall(r's*(.+) by ([^(),;]+)', my_text, re.M)]

[['Harry potter', 'JK Rowling'], ['Dune (first book)', 'Frank Herbert'], ['and Le Petit Prince', 'Antoine de Saint Exupery']]

我就是这么做的,而且对我很有效。我希望它简单易懂。

import re
my_text = """
My favorites books of all time are:
Harry potter by JK Rowling, 
Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""  
# remove newlines
text = my_text.replace('n', '')
# split on ":" and get the second element
text = re.split('[:]', text)[1]
# split on "," and ";"
text_list = re.split('[,;]', text)
# pattern for matching: "2 spaces", "and with 1 space", or end with "1 space+rounded brackets statements+period"
pattern = r"(ss+)|ands|s((.*?))[.]$"
text_list = [re.sub(remove_space_pattern, "", text_element) for text_element in text_list]
# split on "space+by+space"
result = [re.split('sbys', element) for element in text_list]
print(result) # [['Harry potter', 'JK Rowling'], ['Dune (first book)', 'Frank Herbert'], ['Le Petit Prince', 'Antoine de Saint Exupery']]

相关内容

最新更新