用Python解析XML文件并输出JSON

我对Python很陌生。我目前正试图解析xml文件，获取它们的信息，并将它们打印为JSON。

我已经设法解析xml文件，但我不能将它们打印为JSON。此外，在我的printjson函数中，该函数没有遍历所有结果，只打印一次。解析函数可以运行并遍历所有输入文件，而printjson则不能。我的代码如下:

from xml.dom import minidom
import os
import json
#input multiple files
def get_files(d):
return [os.path.join(d, f) for f in os.listdir(d) if os.path.isfile(os.path.join(d,f))]
#parse xml
def parse(files):
for xml_file in files:

#indentify all xml files
tree = minidom.parse(xml_file)
#Get some details
NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)
return NCT_ID,brief_title,official_title
#print result in json
def printjson(results):
for result in results:
output_json = json.dumps(result)
print(output_json)
printjson(parse(get_files('my files path')))

运行文件时的输出

"NCT ID : NCT00571389"
"brief title : Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products"
"official title : A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"

预期输出

{
"NCT ID" : "NCT00571389",
"brief title" : "Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products",
"official title" : "A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"
}

我使用的示例索引xml文件命名为COVID-19临床试验数据集，可在kaggle

中找到。

问题是您的parse函数返回太早(它在从第一个XML文件获得详细信息后返回)。相反，您应该返回存储此信息的字典列表，因此列表中的每个项代表一个不同的文件，并且每个字典包含有关相应XML文件的必要信息。

下面是更新后的代码:

def parse(files):
xml_information = []
for xml_file in files:

#indentify all xml files
tree = minidom.parse(xml_file)
#Get some details
NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)
xml_information.append({"NCT_ID": NCT_ID, "brief title": brief_title, "official title": official_title})
return xml_information
def printresults(results):
for result in results:
print(result)
printresults(parse(get_files('my files path')))

如果你绝对想要返回json格式，你可以类似地在每个字典上使用json.dumps。

注意:如果您有很多XML文件，我建议在函数中使用yield而不是返回整个字典列表，以提高速度和性能。

我对xml了解不多。但是您可以使用字典生成json，因为dumps函数仅用于将json转换为字符串。有的喜欢这样。


def parse(files):
for xml_file in files:

#indentify all xml files
tree = minidom.parse(xml_file)
dicJson = {}
dicJson.setdefault("NCT ID",tree.getElementsByTagName("nct_id")[0].firstChild.data)
dicJson.setdefault("brief title",tree.getElementsByTagName("brief_title")[0].firstChild.data)
dicJson.setdefault("official title", tree.getElementsByTagName("official_title")[0].firstChild.data)
return dicJson

和函数prinJson:

def printJson(results):
# This function return the dictionary but in string, how to write to a JSON file.
print(json.dumps(results))

相关内容

最新更新

热门标签：