如何在没有结束标记的情况下解析XML文档(python)



我正在尝试读取似乎没有结束标记的xml文档。 我没有制作这个 XML 文档,但我从以下位置下载它:

import ftplib
import xml.etree.cElementTree as et
filename = 'FBOFeed20170509'
ftp = ftplib.FTP('ftp.fbo.gov')
ftp.login(user = '', passwd = '')
localfile = open(filename, 'wb')
ftp.retrbinary('RETR ' + filename, localfile.write, 1024)
ftp.quit()
localfile.close()
tree = et.parse(filename)
for node in tree.iter():
    print (node.tag, node.attrib)

这是我的错误:

Traceback (most recent call last):
  File "", line 18, in <module>
tree = et.parse(filename)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1184, in parse
tree.parse(source, parser)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 596, in parse
self._root = parser._parse_whole(source)
  File "<string>", line None
xml.etree.ElementTree.ParseError: mismatched tag: line 24, column 2

所以我用文本编辑器打开文件看了一下,发现没有结束标签。 以下是前 24 行:

<PRESOL>
<DATE>0509
<YEAR>17
<AGENCY>Department of the Air Force
<OFFICE>Air Education and Training Command
<LOCATION>Luke AFB Contracting Squadron
<ZIP>85309
<CLASSCOD>Z
<NAICS>238320
<OFFADD>14100 W. Eagle Street Luke AFB AZ 85309
<SUBJECT>Painting IDIQ Luke AFB
<SOLNBR>FA488717R0005
<CONTACT>Justin A Cheeks, Phone 8566232747, Email justin.cheeks@us.af.mil
<DESC>The 56th ...
<LINK>
<URL>https://www.fbo.gov/spg/USAF/AETC/LukeAFBCS/FA488717R0005/listing.html
<DESC>Link To Document
<SETASIDE>Service-Disabled Veteran-Owned Small Business
<POPCOUNTRY>US
<POPZIP>85309
<POPADDRESS>14100 W Eagle Street (B-26)
Luke AFB, AZ
</PRESOL>

我猜这个错误与 PRESOL 使用/PRESOL 关闭的事实有关,但其他标签都没有关闭。 这是一个简单的条目,其他一些条目在 DESC 或 CONTACT 部分包含各种 HTML 标签,所以我不确定如何在解析之前编写一些东西来关闭标签,例如这里是文件的另一部分:

<CONTACT>Tammy Davis
Tammy.Davis6@va.gov
<a href="mailto:tammy.davis6@va.gov">Tammy.Davis6@va.gov</a>
<DESC>The purpose...

我不确定每个条目的所有标签是否都以相同的顺序甚至相同。 这甚至是XML格式吗? 我应该在这里使用不同的 python 库吗?

https://github.com/presidential-innovation-fellows/fbo-parser 将

每日 FBO 文件解析为 JSON,从而将结束标记添加到通知类型中的字段。 我使用它,然后将其转换为XML文件以将数据导入我的数据库。

我最近遇到了同样的问题。在我用来将所有"PRESOL"(即预征集通知(保存到.csv文件中的代码段下方

tags = ["DATE","YEAR","AGENCY","OFFICE","LOCATION","ZIP",
    "CLASSCOD","OFFADD","SUBJECT","SOLNBR","RESPDATE","ARCHDATE",
    "CONTACT","CONTACTDESC","LINK","URL","URLDESC"]
for y in range(2005,2019):
    outfile =  'my_dir/FBO' + str(y) + '.csv' # output file 
    yearterm = 'FBOFeed' + str(y) + '*'
    counter = 0
    with open(outfile, 'w+') as g:
        writer = csv.writer(g)
        writer.writerow(tags)
        for csvfile in glob.glob(yearterm):
            inpresol = 0
            oldtag = '' # initiate the definition of old tag
            with open(csvfile, 'r+', encoding="latin_1") as f:
                for line in f:
                    tag = line[line.find("<")+1:line.find(">")] # find the line tag
                    if tag == "DESC": # there are multiple "DESC" tags, take care of them
                        dicttag = oldtag + tag
                    else:
                        dicttag = tag
                if "<PRESOL>" in line: # start of the record: initiate the dictionary
                    d = {x : [] for x in tags}    
                    inpresol = 1
                    continue
                elif "</PRESOL>" in line: # end of the record
                    writer.writerow([d["DATE"],d["YEAR"],d["AGENCY"],d["OFFICE"],
                                     d["LOCATION"],d["ZIP"],d["CLASSCOD"],d["OFFADD"],
                                     d["SUBJECT"],d["SOLNBR"],d["RESPDATE"],
                                     d["ARCHDATE"],d["CONTACT"],d["CONTACTDESC"],d["LINK"],
                                     d["URL"],d["URLDESC"]])
                    inpresol = 0
                    continue
                if inpresol == 1: # store the results
                    tagged_tag = "<" + tag + ">"
                    newline = line.replace(tagged_tag, "")
                    d[dicttag] = newline
                    oldtag = tag
            f.close()
    g.close()

我知道它不是太多的"pythonic",但它运行良好,并以.csv格式存储带有记录的年度文件。

您可以使用正则表达式创建结束标记并执行以下操作:

text = re.sub(r"<(w+)>s+([^<]+|)", r"<1>2</1>", text)
text = re.sub(r"<PRESOL>s*</PRESOL>", "<PRESOL>", text)

最新更新