BeautifulSoup网页抓取DOI



我目前正在做一个关于网络抓取的项目,我需要来自Google Scholar记录的信息。我需要提取一篇文章的DOI,相应的HTML页面是这样的。

<span data-v-d3a5356a="" class="metadata--doi">DOI:
<a data-v-d3a5356a="" id="article--doi--link-metadataSec" href="//doi.org/10.1007/s00508-019-1485-6">10.1007/s00508-019-1485-6</a>&nbsp;</span>

我无法用功能提取它

page = BeautifulSoup(response.text, 'html.parser')
page.find_all("span", "data-v-d3a5356a")

我如何提取字符串";10.1007/s05008-019-1485-6"?

该网页是一个动态页面-这意味着数据是由JavaScript加载的beautifulsoup将不适用于动态页面。您必须使用selenium来刮除此网站。

然而,如果您在Chrome DevTools的"网络"选项卡下看到,您可以看到数据是从API加载的。您可以直接从那个API获取数据。这是链接

以下是如何从API端点提取数据。

import requests
url = 'https://europepmc.org/api/get/articleApi?query=(EXT_ID:30980146%20AND%20SRC:med)&format=json&resultType=core'
r = requests.get(url)
x = r.json()
print(f"DOI: {x['resultList']['result'][0]['doi']}")
DOI: 10.1007/s00508-019-1485-6

Ram已经展示了如何从europepmc.org中抓取DOI数据,我添加了代码示例来提取DOI链接和摘要,并将所有内容组合在一起,包括从ieeexplore.ieee.org中解析数据:DOI、DOI URL、摘要。

查看ieeexplore.ieee.org中解析的JSON字符串

from bs4 import BeautifulSoup
import requests, re, json
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
links = [
"https://europepmc.org/api/get/articleApi?query=(EXT_ID:30980146%20AND%20SRC:med)&format=json&resultType=core",
"https://ieeexplore.ieee.org/abstract/document/9599583"
]
data = []
for link in links:
if "ieeexplore" in link:
html = requests.get(link, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
# https://regex101.com/r/8vfYNp/1
doi = json.loads(re.findall(r"xplGlobal.document.metadata=(.*?);", str(soup.select("script")))[0])["doi"]
doi_link = json.loads(re.findall(r"xplGlobal.document.metadata=(.*?);", str(soup.select("script")))[0])["doiLink"]
abstract = json.loads(re.findall(r"xplGlobal.document.metadata=(.*?);", str(soup.select("script")))[0])["abstract"]
data.append({
"parsed_url": link,
"doi": doi,
"doi_link": doi_link,
"abstract": abstract,
})
else:
html = requests.get(link, headers=headers, timeout=30).json()

doi = html["resultList"]["result"][0]["doi"]
doi_link = html["resultList"]["result"][0]["fullTextUrlList"]["fullTextUrl"][0]["url"]
abstract = html["resultList"]["result"][0]["abstractText"]
data.append({
"parsed_url": link,
"doi": doi,
"doi_link": doi_link,
"abstract": abstract,
})
print(json.dumps(data, indent=2))

全输出:

[
{
"parsed_url": "https://europepmc.org/api/get/articleApi?query=(EXT_ID:30980146%20AND%20SRC:med)&format=json&resultType=core",
"doi": "10.1007/s00508-019-1485-6",
"doi_link": "https://doi.org/10.1007/s00508-019-1485-6",
"abstract": "This position statement is based on current evidence available on the safety and benefits of continuous subcutaneous insulin infusion therapy (CSII, pump therapy) in diabetes with an emphasis on the effects of CSII on glycemic control, hypoglycaemia rates, occurrence of ketoacidosis, quality of life and the use of insulin pump therapy in pregnancy. The current article represents the recommendations of the Austrian Diabetes Association for the clinical praxis of insulin pump treatment in children, adolescents and adults."
},
{
"parsed_url": "https://ieeexplore.ieee.org/abstract/document/9599583",
"doi": "10.1109/JPHOT.2021.3124611",
"doi_link": "https://doi.org/10.1109/JPHOT.2021.3124611",
"abstract": "This paper comprehensively investigated noise characteristics of superluminal propagation based on low-noise single-frequency Brillouin lasing oscillation with the aid of a population inversion dynamic grating. Thanks to high-degree polarization alignment between the Brillouin pump and the lased Stokes lightwaves in polarization maintaining fibers, efficient Brillouin lasing resonance with over 10-dB relative intensity noise suppression has been demonstrated to activate Brillouin loss-induced anomalous dispersion in the vicinity of pump signals, benefiting a noise-insensitive superluminal propagation along kilometer-long optical fibers with robust resistance to ambient disturbance. Consequently, sinusoidally modulated pump signals experienced the time advancement of 4634.0 ns at the group velocity of 10.63n<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">c</i>n. Results show that the variance of the fractional advancement with polarization maintaining fibers is 2.54 u00d7 10n<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">u22124</sup>n which is two orders of magnitude lower than that of conventional single mode fibers. Furthermore, the dependence of the group velocity on the modulation frequency was experimentally investigated, showing good agreement with the theoretical analysis."
}
]

最新更新