我想得到所有没有对的元素。这是一个从上到下读取的XML标记列表,去掉了括号。我想找到对(例如,打开标记note
和关闭标记/note
),将它们从列表中删除,然后留下没有对的标记。
你如何遍历列表,将每个标签与所有其他标签进行比较,然后说:啊哈,我发现了另一个以正斜杠开头的"note"标签?
谢谢。
还有其他更好的方法可以找到不匹配的标签吗?
PS:我确实希望保留列表的顺序,如果可能的话,在将标签与列表中的另一个标签进行比较时使用相等。如果使用"in"运算符,它将不起作用,因为如果标记名是像"a"这样的一个字母,则搜索将返回所有包含"a"的元素,而不是与"a"完全匹配的元素。
tags = ['note', 'to', 'bbb', 'bbb', 'firstname', '/firstname', 'lastname', '/lastname', 'from', 'hello', 'hello', 'hello', 'hello', 'hello', 'l', '/from', '/to', 'elephant', 'll', 'from', '/from', 'a1', 'img', 'a2', 'from', 'from', '/from', '/from', '/a2', '/img', '/a1', 'heading', '/heading', 'body', '/body', '/note']
您可以创建一个包含所有结束标记的set
,然后使用该集来过滤标记。
>>> closing = set([t for t in tags if t.startswith("/")])
>>> [t for t in tags if "/" + t not in closing and t not in closing]
['bbb', 'bbb', 'hello', 'hello', 'hello', 'hello', 'hello', 'l', 'elephant', 'll']
然而,请注意,这并不会真正尊重标签的"对",而只是查看列表中是否存在相同标签的"关闭"变体。例如,给定tags = ["a", "a", "/a"]
或tags = ["a", "/a", "a"]
,它将从列表中删除a
的两个实例。
程序的第一部分获取列表中的所有标记。如果您注意到这是查找不匹配括号的问题。它可以通过将该列表视为堆栈来解决,并查找哪些标签有错误,一路迭代。
import re
def clean_attr(attr):
attr_list = re.split(r's+', attr)
if len(attr_list) == 1:
return attr
else:
return attr_list[0] + '>'
line="""
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.
</book>
</catalog>
"""
attr_open = re.findall(r'<[w+s="]+>', line)
attr_closed = re.findall(r'</w+>', line)
all_attrs = re.findall(r'<[w+s="]+>|</w+>', line)
all_attrs_cleaned = map(clean_attr, all_attrs)
# print all_attrs_cleaned
list_as_stack = []
not_closed = []
all_attrs_cleaned = iter(all_attrs_cleaned)
an_attr = all_attrs_cleaned.next()
try:
while all_attrs_cleaned:
if not an_attr.startswith('</'):
list_as_stack.append(an_attr)
an_attr = all_attrs_cleaned.next()
else:
temp = list_as_stack[-1]
if re.search(r'w+', temp).group(0) == re.search(r'w+', an_attr).group(0):
list_as_stack.pop()
an_attr = all_attrs_cleaned.next()
else:
if len(list_as_stack) != 0:
not_closed.append(an_attr)
an_attr = all_attrs_cleaned.next()
except Exception:
print "Stop Iter"
print list_as_stack
print not_closed
在上面的程序中,第一个数组告诉您哪些标记并没有关闭,第二个数组告诉您哪些关闭标记并没有打开标记。