我是网络抓取的新手,我自己也会做一些练习。我正试图提取出现在https://www.clinicaltrials.gov/ct2/results?cond=Activated+蛋白质+C+抗性
我试着先看脚本,但没有在那里找到信息,所以我只搜索了所有的表,并试图找到一个有我正在寻找的数据的表
url = "https://www.clinicaltrials.gov/ct2/results?cond=Activated+Protein+C+Resistance"
re = requests.get(url)
soup = BeautifulSoup(re.text, "html.parser")
table = soup.find_all("table")
我找到了两个表,第一个表没有我要找的数据,但第二个表的属性与有数据的表相同,但似乎没有tbody?
我如何提取所需的表,通常,找到我要查找的数据确切位置的正确方法是什么?
您要查找的表是由javascript加载的,因此您必须使用类似selenium的库。(附带说明,命名响应re可能会令人困惑,因为有一个有用的库称为re(
为此,您需要为您的操作系统下载selenium和chromedriver。
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
ser = Service(YOUR FULL PATH TO chromedriver.exe")
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(service=ser, options=chrome_options)
url = "https://www.clinicaltrials.gov/ct2/results?cond=Activated+Protein+C+Resistance"
driver.get(url)
time.sleep(1)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
df_tables = pd.read_html(str(soup), flavor="bs4")
因为表是数据帧的列表,所以您可以像其他列表一样通过引用来查看每个表。
df_tables[0]
0
0 Individual Patients
1 Intermediate-size Population
2 Treatment IND/Protocol
df_tables[1]
Row Saved Status Study Title Conditions Interventions Locations
0 1 NaN Completed Effect of Resistance Training in Water Combine... Resistance, APC Other: Resistance training that combines water... Guo weiGuyuan, Ningxia, China
1 2 NaN Terminated Study With Atezolizumab Plus Bevacizumab in Pa... MSIColoRectal CancerChemotherapyResistance, APC Drug: AtezolizumabDrug: Bevacizumab KU LeuvenLeuven, BelgiumOspedale Niguarda CA G...