我正在字符串中保存搜索结果。典型的结果如下:https://www.genecards.org/Search/Keyword?queryString=NONHSAT072848.2
在html文件中,我需要的信息在以下部分中:
<table class="table table-striped table-condensed" id="searchResults">
<thead>
<tr>
<th></th>
<th></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Symbol&sortDir=Ascending"
target="_self">Symbol</a>
</th>
<th>Description</th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Category&sortDir=Ascending"
target="_self">Category</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/GeneCard#tocEl-2" target="_blank" title="Read more about gene categories"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Gifts&sortDir=Ascending"
target="_self">GIFtS</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/GeneCard#GIFtS" target="_blank"
title="Read more about GeneCards Inferred Functionality Scores (GIFtS)"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Gcid&sortDir=Ascending"
target="_self">GC id</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/GCids" target="_blank" title="Read more about GeneCards identifiers (GC ids)"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Score&sortDir=Ascending"
target="_self">Score</a>
<a class="gc-help-icon glyphicon glyphicon-question-sign" data-ga-action="Help Icon Click"
href="/Guide/Search#relevance" target="_blank" title="Read more about search scores"></a></th>
</tr>
</thead>
<tbody>
<tr>
<td class="index-col">1</td>
<td class="gc-expand-collapse expand-collapse-col"><a href="#"></a></td>
<td class="gc-gene-symbol gc-highlight symbol-col">
<a href="/cgi-bin/carddisp.pl?gene=IL1R1-AS1&keywords=NONHSAT072848.2" target="_blank"
data-track-event="Result Clicked" data-ga-label="IL1R1-AS1">IL1R1-AS1</a>
</td>
<td class="gc-highlight description-col">IL1R1 Antisense RNA 1</td>
<td class="category-col">RNA Gene</td>
<td class="gifts-col">9</td>
<td class="gc-highlight gcid-col">GC02M102174</td>
<td class="score-col">1.29</td>
</tr>
</tbody>
</table>
这是我的代码:
import lxml.html
import requests
NONCODE_IDs = [
"NONHSAT072848.2",
"NONHSAT182278.1",
"NONHSAG077582.1",
"NONHSAG028748.2",
"NONHSAT151221.1",
"NONHSAT151222.1",
"NONHSAG000557.2"
]
# query link example: https://www.genecards.org/Search/Keyword?queryString=MAPK
my_header = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"
}
link_base = "https://www.genecards.org/Search/Keyword?queryString="
query_link = link_base + NONCODE_IDs[0]
response = requests.get(query_link, headers=my_header)
html = lxml.html.fromstring(response.content)
table = html.xpath('//table[@id="searchResults"]')[0]
然而,
table = html.xpath('//table[@id="searchResults"]')[0]
正在选择比预期更多的内容。
etree.tostring(table)
返回从所需行<table class="table table-striped table-condensed" id="searchResults">
开始到html文件末尾的内容。
我不确定我哪里做错了。
对于这个有针对性的网页,beautifulsoup对我有效。然而,我仍在寻找使用lxml的通用修复程序,因为我是beautifursoup不支持的xpath的粉丝。
以下是可以正确提取表格的beautuloup代码:
from bs4 import BeautifulSoup
import requests
query_link = "https://www.genecards.org/Search/Keyword?queryString=NONHSAT072848.2"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"}
response = requests.get(query_link, headers=headers)
html = BeautifulSoup(response.content, "html.parser")
table = html.find_all("table", {"class": "table table-striped table-condensed", "id": "searchResults"})
print(table)
我仍然不完全确定为什么会发生这种情况,但lxml(与BeautifulSoup不同(似乎将该表视为两个不同的表:一个包含<thead>
,另一个包含<tbody>
。因此,要提取两者,请尝试:
table = html.xpath('//table[@id="searchResults"]')[0]
print(lxml.html.tostring(table[0]).decode())
print(lxml.html.tostring(table[1]).decode())
输出应该是您问题中的输出。