
我目前正在做一个关于网络抓取的项目,我需要来自Google Scholar记录的信息。我需要提取一篇文章的DOI,相应的HTML页面是这样的。

<span data-v-d3a5356a="" class="metadata--doi">DOI:
<a data-v-d3a5356a="" id="article--doi--link-metadataSec" href="//">10.1007/s00508-019-1485-6</a>&nbsp;</span>


page = BeautifulSoup(response.text, 'html.parser')
page.find_all("span", "data-v-d3a5356a")



然而,如果您在Chrome DevTools的"网络"选项卡下看到,您可以看到数据是从API加载的。您可以直接从那个API获取数据。这是链接


import requests
url = ''
r = requests.get(url)
x = r.json()
print(f"DOI: {x['resultList']['result'][0]['doi']}")
DOI: 10.1007/s00508-019-1485-6

Ram已经展示了如何从europepmc.org中抓取DOI数据,我添加了代码示例来提取DOI链接和摘要,并将所有内容组合在一起,包括从中解析数据:DOI、DOI URL、摘要。


from bs4 import BeautifulSoup
import requests, re, json
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
links = [
data = []
for link in links:
if "ieeexplore" in link:
html = requests.get(link, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
doi = json.loads(re.findall(r"xplGlobal.document.metadata=(.*?);", str("script")))[0])["doi"]
doi_link = json.loads(re.findall(r"xplGlobal.document.metadata=(.*?);", str("script")))[0])["doiLink"]
abstract = json.loads(re.findall(r"xplGlobal.document.metadata=(.*?);", str("script")))[0])["abstract"]
"parsed_url": link,
"doi": doi,
"doi_link": doi_link,
"abstract": abstract,
html = requests.get(link, headers=headers, timeout=30).json()

doi = html["resultList"]["result"][0]["doi"]
doi_link = html["resultList"]["result"][0]["fullTextUrlList"]["fullTextUrl"][0]["url"]
abstract = html["resultList"]["result"][0]["abstractText"]
"parsed_url": link,
"doi": doi,
"doi_link": doi_link,
"abstract": abstract,
print(json.dumps(data, indent=2))


"parsed_url": "",
"doi": "10.1007/s00508-019-1485-6",
"doi_link": "",
"abstract": "This position statement is based on current evidence available on the safety and benefits of continuous subcutaneous insulin infusion therapy (CSII, pump therapy) in diabetes with an emphasis on the effects of CSII on glycemic control, hypoglycaemia rates, occurrence of ketoacidosis, quality of life and the use of insulin pump therapy in pregnancy. The current article represents the recommendations of the Austrian Diabetes Association for the clinical praxis of insulin pump treatment in children, adolescents and adults."
"parsed_url": "",
"doi": "10.1109/JPHOT.2021.3124611",
"doi_link": "",
"abstract": "This paper comprehensively investigated noise characteristics of superluminal propagation based on low-noise single-frequency Brillouin lasing oscillation with the aid of a population inversion dynamic grating. Thanks to high-degree polarization alignment between the Brillouin pump and the lased Stokes lightwaves in polarization maintaining fibers, efficient Brillouin lasing resonance with over 10-dB relative intensity noise suppression has been demonstrated to activate Brillouin loss-induced anomalous dispersion in the vicinity of pump signals, benefiting a noise-insensitive superluminal propagation along kilometer-long optical fibers with robust resistance to ambient disturbance. Consequently, sinusoidally modulated pump signals experienced the time advancement of 4634.0 ns at the group velocity of 10.63n<italic xmlns:mml="" xmlns:xlink="">c</i>n. Results show that the variance of the fractional advancement with polarization maintaining fibers is 2.54 u00d7 10n<sup xmlns:mml="" xmlns:xlink="">u22124</sup>n which is two orders of magnitude lower than that of conventional single mode fibers. Furthermore, the dependence of the group velocity on the modulation frequency was experimentally investigated, showing good agreement with the theoretical analysis."
