使用Python将包含html类元素的列表解析为嵌套json



我不擅长将列表的某些部分转换为嵌套Json,希望得到一些指导。我有一个包含如下数据的列表:

"<h5>1",
"<h6>Type of Care|",
"<h6>SA|",
"<h6>Type of Care|",
"<h6>Substance use treatment|",
"<h6>DT Detoxification |",
"<h6>HH Transitional housing, halfway house, or sober home|",
"<h6>SUMH |",
"<h6>Treatment for co-occurring serious mental health  illness/serious emotional disturbance and substance  use disorders|",
"",
"<h5>2",
"<h6>Telemedicine|",
"<h6>TELE|",
"<h6>Telemedicine|",
"<h6>Telemedicine/telehealth|",
"",
"<h5>3 |",
"",
"<h6>Service Settings (e.g., Outpatient, |",
"<h6>Residential, Inpatient, etc.)|",
"<h6>HI|",
"<h6>Service Settings (e.g., Outpatient, |",
"<h6>Residential, Inpatient, etc.)|",
"<h6>Hospital inpatient |",
"<h6>OP Outpatient |",
"<h6>RES Residential|",
"<h6>HID Hospital inpatient detoxification|",
"<h6>HIT Hospital inpatient treatment|",
"<h6>OD Outpatient detoxification|",
"<h6>ODT Outpatient day treatment or partial hospitalization|",
"<h6>OIT Intensive outpatient treatment|",
"<h6>OMB |",
"<h6>Outpatient methadone/buprenorphine or  naltrexone treatment|",
"<h6>ORT Regular outpatient treatment|",
"<h6>RD Residential detoxification|",
"<h6>RL Long-term residential|",
"<h6>RS Short-term residential|"]

我想首先删除列表中没有内容的所有记录,然后我想转换包含像">

"这样的标记的记录将包含";
"的记录放入键并进行分组。输入如下json输出:
"codekey": [
{
"category": [
{
"key": 1,
"value": "Type of Care"
}
],
"codes": [
{
"key": "SA",
"value": "Substance use treatment"
},
{
"key": "DT",
"value": "Detoxification"
},
{
"key": "HH",
"value": "Transitional housing, halfway house, or sober home"
},
{
"key": "SUMH",
"value": "Treatment for co-occurring serious mental health | illness/serious emotional disturbance and substance | use disorders|"
}
]
},
{
"category": [
{
"key": 2,
"value": "Telemedicine"
}
],
"codes": [
{
"key": "TELE",
"value": "TelemedicineTelemedicine/telehealth"

}
]
}
], etc....

我想我需要执行一个循环,但我被困在如何创建"键/值"关系。我认为我也需要使用一个正则表达式,但我只是不擅长Python概念上的数据转换为所需的输出。有什么关于培训的建议吗?我可以查一下,或者有什么关于如何开始的初步建议吗?谢谢你!

考虑您的格式保持不变。下面是一个可配置的灵活解决方案:

class Separator():
def __init__(self, data, title, sep, splitter):
self.data = data # the data
self.title = title # the starting in your case "<h5>"
self.sep = sep # the point where you want to update res
self.splitter = splitter # the separator between key | value
self.res = [] # final res
self.tempDict = {} # tempDict to append
def clearString(self, string, *args):
for arg in args:
string = string.replace(arg, '') # replace every arg to ''
return string.strip()
def updateDict(self, val):
if val == self.sep:
self.res.append(self.tempDict) # update res
self.tempDict = {} # renew tempDict to append
else:
try:
if self.title in val: # check if it "<h5>" in your case
self.tempDict["category"] = [{"key": self.clearString(val, self.title, self.splitter), "value": self.clearString(self.data[self.data.index(val)+1],'<h6>', '|')}] # get the next value
elif self.tempDict["category"][0]["value"] != self.clearString(val, '<h6>', '|'): # check if it is not the "value" of h6 in "category"
val = self.clearString(val,"<h6>").split("|")
if "codes" not in self.tempDict.keys(): self.tempDict["codes"] = [] # create key if not there
self.tempDict["codes"].append({"key": val[0], "value": val[1]})
except: # avoid Exceptions
pass
return self.res
object = Separator(data, '<h5>', '', '|')
for val in data:
res = object.updateDict(val)
print(res)

示例输入的输出:

[
{
'category': [{'key': '1', 'value': 'Type of Care'}],
'codes': [
{'key': 'SA', 'value': 'Substance use treatment'},
{'key': 'DT', 'value': 'Detoxification '},
{
'key': 'HH',
'value': 'Transitional housing, halfway house, or sober home',
},
{
'key': 'SUMH',
'value': 'Treatment for co-occurring serious mental health ',
},
],
},
{
'category': [{'key': '2', 'value': 'Telemedicine'}],
'codes': [
{'key': 'TELE', 'value': 'TelemedicineTelemedicine/telehealth'},
],
},
]

最新更新