我的列表-
[
"Mathematics-2 (21SMT-125)",
"Mid-Semester Test-1",
"40",
"23.5",
"Mid-Semester Test-2",
"40",
"34",
"Disruptive Technologies - 2 (21ECH-103)",
"Experiment-1",
"20",
"19",
"Experiment-2",
"20",
"17",
"Experiment-3",
"20",
"18.5",
]
使用bs4
从html中解析字符串列表转换格式:
{
"Subject": {
"Mathematics-2 (21SMT-125)": {
"Mid-Semester Test-1": [40,23.5],
"Mid-Semester Test-2": [40,34]
},
"Disruptive Technologies - 2 (21ECH-103)": {
"Experiment-1": [20,19],
"Experiment-2": [20,17],
"Experiment-3": [20,18.5]
}
}
}
问题是您提供的列表是一个平面列表,没有指示它们在所需结构中的层次位置。
你可以考虑的一种方法是,如果表示父对象的条目(数学等)是唯一包含括号的条目,你可以在列表上迭代,并使用字符串匹配或正则表达式来识别父对象,为它创建一个顶级对象,然后你需要添加下两个条目作为键/值对的值作为列表。
这假设在子级别总是有两个后续值。如果属性的数量不固定,但它们总是数字,您可以使用regex来确定它是数字还是非数字,并不断向值列表中添加项,直到遇到另一个非数字项,该项将被视为层次结构中的下一个兄弟项。
我会回顾这种方法,并检查是否可以用更聪明的方式解析bs4中的信息-尝试做更多的废弃步骤,首先到达主题,第二学期/实验;三年级。
如果这是不可能的,并且从bs4返回的数据无法更改…你唯一能做的就是试着确定字符串是科目名、学期名还是年级/分数,并尝试使用一些while循环。学科名称似乎在最后有特殊的代码,可以用regexp与学期名/实验名区分开来,年级/分数总是可以解析为数字..
对于与您的数据完全相同的数据(其中带有(
的字符串表示顶级条目,并且每个条目总是有两个数字),您可以提出类似于这样的状态机—但是就像我评论的那样,您实际上应该改进您的解析代码,而不是,因为你要删除数据的HTML很可能已经结构化了。
def is_float(s):
try:
float(s)
return True
except ValueError:
return False
def parse_inp(inp):
flat_map = {}
stack = []
x = 0
while x < len(inp):
if "(" in inp[x]:
stack.clear()
if is_float(inp[x]) and is_float(inp[x + 1]):
flat_map[tuple(stack)] = (float(inp[x]), float(inp[x + 1]))
x += 2
stack.pop(-1)
continue
stack.append(inp[x])
x += 1
return flat_map
def nest_flat_map(flat_map):
root = {}
for key_path, values_list in flat_map.items():
dst = root
for key in key_path[:-1]:
dst = dst.setdefault(key, {})
dst[key_path[-1]] = values_list
return root
inp = [
# ... data from original post
]
nested_map = nest_flat_map(parse_inp(inp))
print(nested_map)
输出期望的
{
"Mathematics-2 (21SMT-125)": {
"Mid-Semester Test-1": (40.0, 23.5),
"Mid-Semester Test-2": (40.0, 34.0),
},
"Disruptive Technologies - 2 (21ECH-103)": {
"Experiment-1": (20.0, 19.0),
"Experiment-2": (20.0, 17.0),
"Experiment-3": (20.0, 18.5),
},
}
您可以使用模糊形式的itertools。Groupby查找此字符串列表中的组。这假定每个类都以"(classref-section)&"模式结束,并且后面跟着测试名或作业名,每个名称后面跟着一个或多个数字分数。
source_data = [
"Mathematics-2 (21SMT-125)",
"Mid-Semester Test-1",
"40",
"23.5",
"Mid-Semester Test-2",
"40",
"34",
"Disruptive Technologies - 2 (21ECH-103)",
"Experiment-1",
"20",
"19",
"Experiment-2",
"20",
"17",
"Experiment-3",
"20",
"18.5",
]
from collections import defaultdict
import itertools
import json
import re
class_id_pattern = re.compile(r"([A-Z0-9]+-d+)")
def is_class_reference(s):
return bool(class_id_pattern.match(s.rsplit(" ", 1)[-1]))
def group_by_class(s):
if is_class_reference(s):
group_by_class.current_class = s
return group_by_class.current_class
group_by_class.current_class = ""
def convert_numeric(s):
try:
return int(s)
except ValueError:
try:
return float(s)
except ValueError:
return None
def is_score(s):
return convert_numeric(s) is not None
def is_test(s):
return not is_score(s)
def group_by_test(s):
if is_test(s):
group_by_test.current_test = s
return group_by_test.current_test
group_by_test.current_test = ""
accum = defaultdict(lambda: defaultdict(list))
for class_name, class_name_and_tests in itertools.groupby(source_data, key=group_by_class):
class_name, *tests = class_name_and_tests
for test_name, test_name_and_scores in itertools.groupby(tests, key=group_by_test):
test_name, *scores = test_name_and_scores
accum[class_name][test_name].extend(convert_numeric(s) for s in scores)
print(json.dumps(accum, indent=4))
打印:
{
"Mathematics-2 (21SMT-125)": {
"Mid-Semester Test-1": [
40,
23.5
],
"Mid-Semester Test-2": [
40,
34
]
},
"Disruptive Technologies - 2 (21ECH-103)": {
"Experiment-1": [
20,
19
],
"Experiment-2": [
20,
17
],
"Experiment-3": [
20,
18.5
]
}
}
在我的博客文章中阅读更多关于fuzzy groupby的内容:https://thingspython.wordpress.com/2020/11/11/fuzzy-groupby-unusual-restaurant-part-ii/