用Python解析XML文件并输出JSON



我对Python很陌生。我目前正试图解析xml文件,获取它们的信息,并将它们打印为JSON。

我已经设法解析xml文件,但我不能将它们打印为JSON。此外,在我的printjson函数中,该函数没有遍历所有结果,只打印一次。解析函数可以运行并遍历所有输入文件,而printjson则不能。我的代码如下:

from xml.dom import minidom
import os
import json
#input multiple files
def get_files(d):
return [os.path.join(d, f) for f in os.listdir(d) if os.path.isfile(os.path.join(d,f))]
#parse xml
def parse(files):
for xml_file in files:

#indentify all xml files
tree = minidom.parse(xml_file)
#Get some details
NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)
return NCT_ID,brief_title,official_title
#print result in json
def printjson(results):
for result in results:
output_json = json.dumps(result)
print(output_json)
printjson(parse(get_files('my files path')))

运行文件时的输出

"NCT ID : NCT00571389"
"brief title : Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products"
"official title : A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"

预期输出

{
"NCT ID" : "NCT00571389",
"brief title" : "Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products",
"official title" : "A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"
}

我使用的示例索引xml文件命名为COVID-19临床试验数据集,可在kaggle

中找到。

问题是您的parse函数返回太早(它在从第一个XML文件获得详细信息后返回)。相反,您应该返回存储此信息的字典列表,因此列表中的每个项代表一个不同的文件,并且每个字典包含有关相应XML文件的必要信息。

下面是更新后的代码:

def parse(files):
xml_information = []
for xml_file in files:

#indentify all xml files
tree = minidom.parse(xml_file)
#Get some details
NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)
xml_information.append({"NCT_ID": NCT_ID, "brief title": brief_title, "official title": official_title})
return xml_information
def printresults(results):
for result in results:
print(result)
printresults(parse(get_files('my files path')))

如果你绝对想要返回json格式,你可以类似地在每个字典上使用json.dumps

注意:如果您有很多XML文件,我建议在函数中使用yield而不是返回整个字典列表,以提高速度和性能。

我对xml了解不多。但是您可以使用字典生成json,因为dumps函数仅用于将json转换为字符串。有的喜欢这样。


def parse(files):
for xml_file in files:

#indentify all xml files
tree = minidom.parse(xml_file)
dicJson = {}
dicJson.setdefault("NCT ID",tree.getElementsByTagName("nct_id")[0].firstChild.data)
dicJson.setdefault("brief title",tree.getElementsByTagName("brief_title")[0].firstChild.data)
dicJson.setdefault("official title", tree.getElementsByTagName("official_title")[0].firstChild.data)
return dicJson

和函数prinJson:

def printJson(results):
# This function return the dictionary but in string, how to write to a JSON file.
print(json.dumps(results))

最新更新