MCQ 类型字符串的正则表达式



如何从文本文档中提取多项选择题及其选项。每个问题都以数字和点开头。每个问题可以跨越多行,并且可能/可能没有句号或问号。 我想制作一个带有问题编号以及相应问题和选项的字典。 我为此使用python。

17.
If you go on increasing the stretching force on a wire in a
guitar, its frequency.
(a)
increases
(b)
decreases
(c)
remains unchanged
(d)
None of these
some random text between questions
18.
A vibrating body
(a)
will always produce sound
(b)
may or may not produce sound if the amplitude of
vibration is low
(c)
will produce sound which depends upon frequency
(d)
None of these
19.
The wavelength of infrasonics in air is of the order of
(a)
100 m
(b)
101 m
(c)
10–1 m
(d)
10–2 m

解决方案

假设您的问题来自questions.txt文件。

17.
If you go on increasing the stretching force on a wire in a
guitar, its frequency.
(a)
increases
(b)
decreases
(c)
remains unchanged
(d)
None of these
some random text between questions
18.
A vibrating body
(a)
will always produce sound
(b)
may or may not produce sound if the amplitude of
vibration is low
(c)
will produce sound which depends upon frequency
(d)
None of these
19.
The wavelength of infrasonics in air is of the order of
(a)
100 m
(b)
101 m
(c)
10–1 m
(d)
10–2 m

用于根据需求解析questions.txt的 Python 代码。

import re
filename = 'questions.txt'
questions = []
with open(file=filename, mode='r', encoding='utf8') as f:
lines = f.readlines()
is_label = False  # means matched: 17.|(a)|(b)|(c)|(d)
statement = option_a = option_b = option_c = option_d = ''
for line in lines:
if re.match(r'^d+.$', line):
is_statement = is_label = True
is_option_a = is_option_b = is_option_c = is_option_d = False
elif re.match(r'^(a)$', line):
is_option_a = is_label = True
is_statement = is_option_b = is_option_c = is_option_d = False
elif re.match(r'^(b)$', line):
is_option_b = is_label = True
is_statement = is_option_a = is_option_c = is_option_d = False
elif re.match(r'^(c)$', line):
is_option_c = is_label = True
is_statement = is_option_a = is_option_b = is_option_d = False
elif re.match(r'^(d)$', line):
is_option_d = is_label = True
is_statement = is_option_a = is_option_b = is_option_c = False
else:
is_label = False
if is_label:
continue
if is_statement:
statement += line
elif is_option_a:
option_a = line.rstrip()
elif is_option_b:
option_b = line.rstrip()
elif is_option_c:
option_c = line.rstrip()
elif is_option_d:
option_d = line.rstrip()
if statement:
questions.append({
'statement': statement.rstrip(),
'options': [option_a, option_b, option_c, option_d]
})
statement = option_a = option_b = option_c = option_d = ''
print(questions)

输出

[
{
"statement": "If you go on increasing the stretching force on a wire in anguitar, its frequency.",
"options": [
"increases",
"decreases",
"remains unchanged",
"None of these"
]
},
{
"statement": "A vibrating body",
"options": [
"will always produce sound",
"vibration is low",
"will produce sound which depends upon frequency",
"None of these"
]
},
{
"statement": "The wavelength of infrasonics in air is of the order of",
"options": [
"100 m",
"101 m",
"10–1 m",
"10–2 m"
]
}
]

旁注

  • 忽略类似some random text between questions的文本
  • 带有多行语句的问题保持原样(表示有意不删除换行符(。您可以选择将n替换为<space>字符。

Hamza的答案很好,但它忽略了一个答案可能是多行的事实。

更好的解决方案: (假设有问题的文本在数据.txt文件中(

import re
with open('data.txt', 'r', encoding='utf8') as file:
data = file.read()
questions = re.split(r'ns*n', data) #splits the questions into a list assuming there is no empty lines inside each question
final_questions = []
for question in questions:
if question != None and '(a)' in question: #extra check to make sure that this a question
statement = re.findall(r'[^(]+', question)[0].replace('n', ' ').rstrip()
option_a = re.findall(r'(a)[^(]+', question)[0].replace('n', ' ').rstrip()
option_b = re.findall(r'(b)[^(]+', question)[0].replace('n', ' ').rstrip()
option_c = re.findall(r'(c)[^(]+', question)[0].replace('n', ' ').rstrip()
option_d = re.findall(r'(d)[^(]+', question)[0].replace('n', ' ').rstrip()
final_questions.append({
'statement': statement.rstrip(),
'options': [option_a, option_b, option_c, option_d]
})
print(final_questions)

输出:

[
{
"statement":"17. If you go on increasing the stretching force on a wire in a guitar, its frequency.",
"options":[
"(a) increases",
"(b) decreases",
"(c) remains unchanged",
"(d) None of these"
]
},
{
"statement":"18. A vibrating body",
"options":[
"(a) will always produce sound",
"(b) may or may not produce sound if the amplitude of vibration is low",
"(c) will produce sound which depends upon frequency",
"(d) None of these"
]
},
{
"statement":"19. The wavelength of infrasonics in air is of the order of",
"options":[
"(a) 100 m",
"(b) 101 m",
"(c) 10–1 m",
"(d) 10–2 m"
]
}
]

注意::每个问题之间至少应有一个空行

正则表达式:d+.([^(]+)它得到数字,然后是一个点。

然后它捕获所有不是(的东西(答案的开始(。

如果您不确定它是否那么简单,请在此处测试正则表达式。

蟒蛇代码:

import re # Imports the standard regex module
text_doc = """
17.
If you go on increasing the stretching force on a wire in a
guitar, its frequency.
(a)
increases
(b)
decreases
(c)
remains unchanged
(d)
None of these
some random text between questions
18.
A vibrating body
(a)
will always produce sound
(b)
may or may not produce sound if the amplitude of
vibration is low
(c)
will produce sound which depends upon frequency
(d)
None of these
19.
The wavelength of infrasonics in air is of the order of
(a)
100 m
(b)
101 m
(c)
10–1 m
(d)
10–2 m
"""
question_getter = re.compile('\d+\.([^(]+)')
print(question_getter.findall(text_doc))

编辑:但是由于很多人在这里解析东西,我想我也会解析东西

用于获取可能答案的正则表达式:([a-zA-Z]+)n(.+)

证明

更新的蟒蛇:

import re # Imports the standard regex module

text_doc = """
17.
If you go on increasing the stretching force on a wire in a
guitar, its frequency.
(a)
increases
(b)
decreases
(c)
remains unchanged
(d)
None of these
some random text between questions
18.
A vibrating body
(a)
will always produce sound
(b)
may or may not produce sound if the amplitude of
vibration is low
(c)
will produce sound which depends upon frequency
(d)
None of these
19.
The wavelength of infrasonics in air is of the order of
(a)
100 m
(b)
101 m
(c)
10–1 m
(d)
10–2 m
"""
question_getter = re.compile('\d+\.([^(]+)')
answer_getter = re.compile('\([a-zA-Z]+\)\n(.+)')

# This is where the magical parsing happens
# It could've been organized differently
parsed = {question:answer_getter.findall(text_doc)
for question in question_getter.findall(text_doc)
}
print(parsed)

您可以将以下正则表达式与 Python 的标准 re 模块一起使用,以匹配每个问题。

r'(?P<number>d+). *r?n(?P<question>(?:(?!([a-z])).*r?n)+)(?P<options>(?:(?!(?<=n)d+. *r?n).*r?n)+)'

问题编号将包含在捕获组(命名(number中,问题本身将包含在捕获组question中,选项将包含在捕获组options中。

然后,可以使用Python代码轻松获取捕获组的内容,并根据需要进行处理。例如,可以构造一个问题数组,每个问题都是带有数字、问题和选项键的哈希值,或者可能是一个哈希值,其中键是问题编号,值是带有问题和选项键的哈希值。

启动引擎!

Python 的正则表达式引擎执行以下操作。

(?P<number>d+)  : match 1+ digits in capture group 'number'
. *r?n        : match '.' 0+ spaces, line terminator 
(?P<question>    : begin capture group 'question'
(?:            : begin non-capture group
(?!          : begin negative lookahead
([a-z])  : match '(', one lowercase letter, ')'
)            : end negative lookahead
.*r?n      : match 0+ characters, 'r' optionally, 'n'
)              : end non-capture group
+              : execute non-capture group 1+ times
)                : end capture group 'question'
(?P<options>     : begin capture group 'options'
(?:            : begin non-capture group
(?!          : begin negative lookahead
(?<=n)    : positive lookbehind asserts next character is
preceded by a 'n'
d+        : match 1+ digits
. *r?n  : match '.' 0+ spaces, line terminator 
)            : end negative lookahead
.*r?n      : match 0+ characters, 'r' optionally, 'n'
)              : end non-capture group
+              : execute non-capture group 1+ times
)                : end capture group 'options'

在两个位置,我匹配任何字符(.(。当然,这可以替换为限制可能性的字符类,例如[a-zA-Zd() -–]。裁判

最新更新