无法使用Clinicaltrials.gov中的Beautful汤提取表格



我是网络抓取的新手,我自己也会做一些练习。我正试图提取出现在https://www.clinicaltrials.gov/ct2/results?cond=Activated+蛋白质+C+抗性

我试着先看脚本,但没有在那里找到信息,所以我只搜索了所有的表,并试图找到一个有我正在寻找的数据的表

url =  "https://www.clinicaltrials.gov/ct2/results?cond=Activated+Protein+C+Resistance"
re = requests.get(url)
soup = BeautifulSoup(re.text, "html.parser")
table = soup.find_all("table")

我找到了两个表,第一个表没有我要找的数据,但第二个表的属性与有数据的表相同,但似乎没有tbody?

我如何提取所需的表,通常,找到我要查找的数据确切位置的正确方法是什么?

您要查找的表是由javascript加载的,因此您必须使用类似selenium的库。(附带说明,命名响应re可能会令人困惑,因为有一个有用的库称为re(

为此,您需要为您的操作系统下载selenium和chromedriver。

import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
ser = Service(YOUR FULL PATH TO chromedriver.exe")
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(service=ser, options=chrome_options)
url =  "https://www.clinicaltrials.gov/ct2/results?cond=Activated+Protein+C+Resistance"
driver.get(url)
time.sleep(1)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
df_tables = pd.read_html(str(soup), flavor="bs4")

因为表是数据帧的列表,所以您可以像其他列表一样通过引用来查看每个表。

df_tables[0]
0
0           Individual Patients
1  Intermediate-size Population
2        Treatment IND/Protocol
df_tables[1]
Row  Saved      Status                                        Study Title                                       Conditions                                      Interventions                                          Locations
0    1    NaN   Completed  Effect of Resistance Training in Water Combine...                                  Resistance, APC  Other: Resistance training that combines water...                      Guo weiGuyuan, Ningxia, China
1    2    NaN  Terminated  Study With Atezolizumab Plus Bevacizumab in Pa...  MSIColoRectal CancerChemotherapyResistance, APC                Drug: AtezolizumabDrug: Bevacizumab  KU LeuvenLeuven, BelgiumOspedale Niguarda CA G...

最新更新