如果href属性在列表中，则获取文本，但如果href属性重复，则禁止获取文本

我目前有两个列表。一个包含两个锚元素，都包含相同的href，但不同的text:

list1 = [<a href="link1">'text1'</a>, <a href="link1">'text2'</a>, 
         <a href="link2"><a href="link2"><span class="flagicon">
         <img Img stuff/></span>'text3'</a>, <a href="link2">'text4'</a>]

从这个列表中，我已经设法获得了href链接，然后我删除了所有重复的链接。由于存在两个href链接，并且它们是相同的，因此删除了其中一个。现在，我的唯一href链接列表是：

list2 = ['link1','link2']

现在是棘手的部分。我想使用第二个列表中唯一的href，在第一个列表中查找相应的文本，但只能查找一次。我使用这个例子只提取唯一的href元素，同时保持顺序。我还想使用它来从list1获得属于唯一href的text。

seen_text = set()
seen_text_add = seen_text.add
unique_text = [x.text for x in list1 if list2 in x and not (x in seen or seen_add(x))]

但这只是返回一个空列表。这能做到吗？

编辑：我的预期结果是unique_text =['text1','text3']

以下是如何使用生成器（针对最新示例进行了编辑）：

import re
list1 = ["<a href='link1'>'text1'</a>",
         "<a href='link1'>'text2'</a>",
         "<a href='link2'><a href='link2'><span class='flagicon'><img Img stuff/></span>'text3'</a>",
         "<a href='link2'>'text4'</a>"]
list2 = ['link1', 'link2', 'link3']

def gen(txt):
    for elem in list1:
        if txt in elem:
            # Grab only the text between a pair of tags (meaning end of tag >text< start of next tag)
            yield re.match('.*>(?P<text>.+)<.*', elem).group('text')
# For each text in list2 create a generator that will yield matching text from list1.
# Call next on that generator to grab the first result only, with default value of "not found"
x = [next(gen(text), "not found") for text in list2]
print(x)
>>> ["'text1'", "'text3'", 'not found'] # Further process the list (get rid of the quotes etc.)

如果这仍然不起作用，你能打印出list1和list2的内容并粘贴在这里吗？

相关内容

最新更新

热门标签：