我怎样才能从使用BeautifulSoup在python网站获得信息?



我必须采取以下网页中显示的出版日期与BeautifulSoup in python:

https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410

关键是,当我从"inspect"网页中搜索html代码时,我很快找到了发布日期,但当我在python中搜索html代码时,我找不到它,即使使用find()find_all()函数。

我试过这个代码:

import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)
soup.find_all('span', id_= 'biblio-publication-number-content')

但是它给了我[],而在在线页面的'inspect'代码中,有这个标签。

我做错了什么,有"检查"代码,是不同于一个我得到与BeautifulSoup?

我如何解决这个问题并获得号码?

这个问题我认为是由于你正在寻找的内容被JavaScript加载后初始页面加载。requests将只显示在DOM被JavaScript修改之前的初始页面内容。

为此,您可以尝试安装selenium,然后为您的特定浏览器下载Selenium web driver。将驱动程序安装在您的路径中的某个目录中,然后(这里我使用Chrome):

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
# Wait (for up to 10 seconds) for the element we want to appear:
driver.implicitly_wait(10)
elem = driver.find_element(By.ID, 'biblio-publication-number-content')
# Now we can use soup:
soup = bs(driver.page_source, "html.parser")
print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
driver.quit()

打印:

<span id="biblio-publication-number-content"><span class="search">CN105030410</span>A·2015-11-11</span>

Umberto如果你正在寻找一个html元素span使用以下代码:

import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
results = soup.find_all('span')
[r for r in results]

如果您正在寻找id为'biblio-publication-number-content'的HTML,请使用以下代码


import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
soup.find_all(id='biblio-publication-number-content')

在第一种情况下,您正在获取所有spanhtml元素在第二种情况下,您正在获取id为'biblio-publication-number-content'的所有元素

我建议你研究一下html标签和元素,以便更深入地理解它们是如何工作的,以及它们背后的语义是什么。

最新更新