Python lxml xpath 1.0：元素属性的唯一值

这是一种获取唯一值的方法。如果我想获得唯一属性，它不起作用。例如：

<a href = '11111'>sometext</a>
<a href = '11121'>sometext2</a>
<a href = '11111'>sometext3</a>

我想获得独特的href。受使用 xpath 1.0 的限制

page_src.xpath( '(//a[not(.=preceding::a)] )')
page_src.xpath( '//a/@href[not(.=preceding::a/@href)]' )

返回重复项。有没有可能在缺席的情况下解决这个噩梦unique-values？

UPD ：这不是我想要的函数这样的解决方案，但我编写了 python 函数，它迭代父元素并检查添加父标签过滤器是否链接到所需的计数。

这是我的例子：

_x_item = (
'//a[starts-with(@href, "%s")'
'and (not(@href="%s"))'
'and (not (starts-with(@href, "%s"))) ]'
%(param1, param1, param2 ))
#rm double links
neededLinks = list(map(lambda vasa: vasa.get('href'), page_src.xpath(_x_item)))
if len(neededLinks)!=len(list(set(neededLinks))):
uniqLength = len(list(set(neededLinks)))
breakFlag = False
for linkk in neededLinks:
if neededLinks.count(linkk)>1:
dupLinks = page_src.xpath('//a[@href="%s"]'%(linkk))
dupLinkParents = list(map(lambda vasa: vasa.getparent(), dupLinks))
for dupParent in dupLinkParents:
tempLinks = page_src.xpath(_x_item.replace('//','//%s/'%(dupParent.tag)))
tempLinks = list(map(lambda vasa: vasa.get('href'), tempLinks))
if len(tempLinks)==len(set(neededLinks)):
breakFlag = True
_x_item = _x_item.replace('//','//%s/'%(dupParent.tag))
break
if breakFlag:
break

如果重复链接具有不同的父链接，但具有相同的@href值，这将起作用。

因此，我将添加 parent.tag 前缀，例如//div/my_prev_x_item

另外，使用python，我可以将结果更新为//div[@key1="val1" and @key2="val2"]/my_prev_x_item，迭代dupParent.items()。但这仅在项目不位于同一父对象中时才有效。

结果我只需要x_path_expression，所以我不能只使用list(set(myItems)).

我想要更简单的解决方案(如unique-values()(，如果存在的话。另外，如果链接的父级相同，我的解决方案将不起作用。

您可以提取所有 href，然后找到唯一的 href：

all_hrefs = page_src.xpath('//a/@href')
unique_hrefs = list(set(all_hrefs))

相关内容

最新更新

热门标签：