使用BeautifulSoup和Python从PubMed搜索结果中抓取引文文本?



所以我试图从每篇文章的 PubMed 搜索中抓取所有 AMA 格式的引文。 以下代码仅用于获取第一篇文章中的引文数据。

import requests
import xlsxwriter
from bs4 import BeautifulSoup

URL = 'https://pubmed.ncbi.nlm.nih.gov/?term=infant+formula&size=200'
response = requests.get(URL)
html_soup = BeautifulSoup(response.text, 'html5lib')
article_containers = html_soup.find_all('article', class_ = 'labs-full-docsum')
first_article = article_containers[0]
citation_text = first_article.find('div', class_ = 'docsum-wrap').find('div', class_ = 'result-actions-bar').div.div.find('div', class_ = 'content').div.div.text
print(citation_text)

该脚本返回一个空行,即使当我通过谷歌浏览器检查源代码时,文本在该"div"中清晰可见。

这与 JavaScript 有关吗?如果是,我该如何解决它?

此脚本将从提供的 URL 获取"AMA"格式的所有引文:

import json
import requests
from bs4 import BeautifulSoup

URL = 'https://pubmed.ncbi.nlm.nih.gov/?term=infant+formula&size=200'
response = requests.get(URL)
html_soup = BeautifulSoup(response.text, 'html5lib')
for article in html_soup.select('article'):
print(article.select_one('.labs-docsum-title').get_text(strip=True, separator=' '))
citation_id = article.input['value']
data = requests.get('https://pubmed.ncbi.nlm.nih.gov/{citation_id}/citations/'.format(citation_id=citation_id)).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
print(data['ama']['orig'])
print('-' * 80)

指纹:

Review of Infant Feeding: Key Features of Breast Milk and Infant Formula .
Martin CR, Ling PR, Blackburn GL. Review of Infant Feeding: Key Features of Breast Milk and Infant Formula. Nutrients. 2016;8(5):279. Published 2016 May 11. doi:10.3390/nu8050279
--------------------------------------------------------------------------------
Prebiotics in infant formula .
Vandenplas Y, De Greef E, Veereman G. Prebiotics in infant formula. Gut Microbes. 2014;5(6):681-687. doi:10.4161/19490976.2014.972237
--------------------------------------------------------------------------------
Effects of infant formula composition on long-term metabolic health.
Lemaire M, Le Huërou-Luron I, Blat S. Effects of infant formula composition on long-term metabolic health. J Dev Orig Health Dis. 2018;9(6):573-589. doi:10.1017/S2040174417000964
--------------------------------------------------------------------------------
Selenium in infant formula milk.
He MJ, Zhang SQ, Mu W, Huang ZW. Selenium in infant formula milk. Asia Pac J Clin Nutr. 2018;27(2):284-292. doi:10.6133/apjcn.042017.12
--------------------------------------------------------------------------------
... and so on.

最新更新