Python lxml-xpath选择的内容比预期的要多



我正在字符串中保存搜索结果。典型的结果如下:https://www.genecards.org/Search/Keyword?queryString=NONHSAT072848.2

在html文件中,我需要的信息在以下部分中:

<table class="table table-striped table-condensed" id="searchResults">
<thead>
<tr>
<th></th>
<th></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&amp;pageSize=25&amp;startPage=0&amp;sort=Symbol&amp;sortDir=Ascending"
target="_self">Symbol</a>
</th>
<th>Description</th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&amp;pageSize=25&amp;startPage=0&amp;sort=Category&amp;sortDir=Ascending"
target="_self">Category</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/GeneCard#tocEl-2" target="_blank" title="Read more about gene categories"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&amp;pageSize=25&amp;startPage=0&amp;sort=Gifts&amp;sortDir=Ascending"
target="_self">GIFtS</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/GeneCard#GIFtS" target="_blank"
title="Read more about GeneCards Inferred Functionality Scores (GIFtS)"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&amp;pageSize=25&amp;startPage=0&amp;sort=Gcid&amp;sortDir=Ascending"
target="_self">GC id</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/GCids" target="_blank" title="Read more about GeneCards identifiers (GC ids)"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&amp;pageSize=25&amp;startPage=0&amp;sort=Score&amp;sortDir=Ascending"
target="_self">Score</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/Search#relevance" target="_blank" title="Read more about search scores"></a></th>
</tr>
</thead>
<tbody>
<tr>
<td class="index-col">1</td>
<td class="gc-expand-collapse expand-collapse-col"><a href="#"></a></td>
<td class="gc-gene-symbol gc-highlight symbol-col">
<a href="/cgi-bin/carddisp.pl?gene=IL1R1-AS1&amp;keywords=NONHSAT072848.2" target="_blank"
data-track-event="Result Clicked" data-ga-label="IL1R1-AS1">IL1R1-AS1</a>
</td>
<td class="gc-highlight description-col">IL1R1 Antisense RNA 1</td>
<td class="category-col">RNA Gene</td>
<td class="gifts-col">9</td>
<td class="gc-highlight gcid-col">GC02M102174</td>
<td class="score-col">1.29</td>
</tr>
</tbody>
</table>

这是我的代码:

import lxml.html
import requests
NONCODE_IDs = [
"NONHSAT072848.2",
"NONHSAT182278.1",
"NONHSAG077582.1",
"NONHSAG028748.2",
"NONHSAT151221.1",
"NONHSAT151222.1",
"NONHSAG000557.2"
]
# query link example: https://www.genecards.org/Search/Keyword?queryString=MAPK
my_header = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"
}
link_base = "https://www.genecards.org/Search/Keyword?queryString="
query_link = link_base + NONCODE_IDs[0]
response = requests.get(query_link, headers=my_header)
html = lxml.html.fromstring(response.content)
table = html.xpath('//table[@id="searchResults"]')[0]

然而,

table = html.xpath('//table[@id="searchResults"]')[0]正在选择比预期更多的内容。

etree.tostring(table)返回从所需行<table class="table table-striped table-condensed" id="searchResults">开始到html文件末尾的内容。

我不确定我哪里做错了。

对于这个有针对性的网页,beautifulsoup对我有效。然而,我仍在寻找使用lxml的通用修复程序,因为我是beautifursoup不支持的xpath的粉丝。

以下是可以正确提取表格的beautuloup代码:

from bs4 import BeautifulSoup
import requests
query_link = "https://www.genecards.org/Search/Keyword?queryString=NONHSAT072848.2"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"}
response = requests.get(query_link, headers=headers)
html = BeautifulSoup(response.content, "html.parser")
table = html.find_all("table", {"class": "table table-striped table-condensed", "id": "searchResults"})
print(table)

我仍然不完全确定为什么会发生这种情况,但lxml(与BeautifulSoup不同(似乎将该表视为两个不同的表:一个包含<thead>,另一个包含<tbody>。因此,要提取两者,请尝试:

table = html.xpath('//table[@id="searchResults"]')[0]
print(lxml.html.tostring(table[0]).decode())
print(lxml.html.tostring(table[1]).decode())

输出应该是您问题中的输出。

最新更新