尝试使用Selenium来网络抓取 ncbi,数据无法加载,并且不包含在具有我可以等待的 ID 的元素中



我正试图从像https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta.

我用的是漂亮的汤和硒。

数据位于id为viewercontent1的元素内。当我用这个代码打印出来时:

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import re
secondDriver = webdriver.Chrome(executable_path='/Users/me/Documents/chloroPlastGenScrape/chromedriver')
newLink = "https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta"
secondDriver.implicitly_wait(10)
WebDriverWait(secondDriver, 10).until(lambda driver: driver.execute_script('return document.readyState') == 'complete')
secondDriver.get(newLink)
html2 = secondDriver.page_source
subSoup = BeautifulSoup(html2, 'html.parser')
viewercontent1 = subSoup.findAll("div", {"id" : "viewercontent1"})[0]
print(viewercontent1)

我打印出来:

<div class="seq gbff" id="viewercontent1" sequencesize="450826" style="display: block;" val="426261815" virtualsequence=""><div class="loading">Loading ... <img alt="record loading animation" src="/core/extjs/ext-2.1/resources/images/default/grid/loading.gif"/></div></div>

内容似乎还没有加载完毕。我试着隐式地等待和检查内容是否加载完成(在调用.get((函数之前和之后(,但这似乎没有起到任何作用。我迫不及待地想通过ID(presence_of_element_located(来加载内容,因为数据直接包含在ID为on的<pre></pre>元素中。

如有任何帮助,我们将不胜感激。

要获取<div>的内容,可以使用以下脚本:

import requests
from bs4 import BeautifulSoup

url = 'https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta'
fasta_url = 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={id}&report=fasta'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
id_ = soup.select_one('meta[name="ncbi_uidlist"]')['content']
fasta_txt = requests.get(fasta_url.format(id=id_)).text
print(fasta_txt)

打印:

>KC208619.1 Butomus umbellatus mitochondrion, complete genome
CCGCCTCTCCCCCCCCCCCCCCGCTCCGTTGTTGAAGCGGGCCCCCCCCATACTCATGAATCTGCATTCC
CAACCAAGGAGTTGTCTCATATAGACAGAGTTGGGCCCCCGTGTTCTGAGATCTTTTTCAACTTGATTAA
TAAAGAGGATTTCTCGGCCGTCTTTTTCGGCTAGGCTCCATTCGGGGTGGGTGTCCAGCTCGTCCCGCTT
CTCGTTAAAGAAATCGATAAAGGCTTCTTCGGGGGTGTAGGCGGCATTTTCCCCCAAGTGGGGATGTCGA
GAAAGCACTTCTTGAAAACGAGAATAAGCTGCGTGCTTACGTTCCCGGATTTGGAGATCCCGGTTTTCGA
...and so on.

@Andrej的解决方案似乎简单得多,但如果你仍然想走等待的路线。。。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
driver = webdriver.Chrome()
newLink = "https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta"
driver.get(newLink)
WebDriverWait(driver, 10).until(lambda driver: driver.execute_script('return document.readyState') == 'complete')
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "#viewercontent1 pre"))
)
html2 = driver.page_source
subSoup = BeautifulSoup(html2, 'html.parser')
viewercontent1 = subSoup.findAll("div", {"id" : "viewercontent1"})[0]
print(viewercontent1)

相关内容

最新更新