我尝试重新学习python,以便缺乏技能。我目前正在与PubMed API一起玩。我正在尝试解析此处给出的XML文件,然后运行一个循环以通过每个孩子('/pubmedarmarticle')并抓住一些东西,现在只有文章标题,然后将它们输入到下面的字典中PubMedid的钥匙(PMID)。
即。输出应该看起来像:
{'29150897': {'title': 'Determining best outcomes from community-acquired pneumonia and how to achieve them.'}
'29149862': {'title': 'Telemedicine as an effective intervention to improve antibiotic appropriateness prescription and to reduce costs in pediatrics.'}}
后来,我将添加更多因素,例如作者和日记等,现在我只想弄清楚如何使用LXML软件包将我想要的数据获取到字典中。我知道有很多包装可以为我做到这一点,但这却失去了学习的目的。我已经尝试了很多不同的事情,这就是我目前要做的:
from lxml import etree
article_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
page = requests.get(article_url)
tree = etree.fromstring(page.content)
dict_out = {}
for x in tree.xpath('//PubmedArticle'):
pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])
title = ''.join([x.text for x in x.xpath('//ArticleTitle')])
dict_out[pmid] = {'title': title}
print(dict_out)
我可能对如何进行此过程有误会,但是如果有人可以提供洞察力或带领我朝着正确的资源方向前进,这将不胜感激。
编辑:我很抱歉。我写的比应该更快。我已经修复了所有情况。另外,结果似乎将PMID结合在一起,而仅给出第一个标题:
{'2725403628806902': {'title': 'Handshake Stewardship: A Highly Effective Rounding-based Antimicrobial Optimization Service.Monitoring, documenting and reporting the quality of antibiotic use in the Netherlands: a pilot study to establish a national antimicrobial stewardship registry.'}}
ta
code.py :
#!/usr/bin/env python3
import sys
import requests
from lxml import etree
from pprint import pprint as pp
ARTICLE_URL = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
def main():
response = requests.get(ARTICLE_URL)
tree = etree.fromstring(response.content)
ids = tree.xpath("//MedlineCitation/PMID[@Version='1']")
titles = tree.xpath("//Article/ArticleTitle")
if len(ids) != len(titles):
print("ID count doesn't match Title count...")
return
result = {_id.text: {"title": title.text} for _id, title in zip(ids, titles)}
pp(result)
if __name__ == "__main__":
print("Python {:s} on {:s}n".format(sys.version, sys.platform))
main()
注释:
- 我稍微构造了一些代码,并重命名了一些变量以确保清晰度
- ids 保留 pmid 节点的列表,而 titles 保留(对应) articletitle 节点(注意路径!)
- 以所需格式加入它们的方法是使用[Python]:Pep 274- dict enlastions,同时在2列表上迭代,[Python 3]: zip *Iterables )被使用
输出:
(py35x64_test) c:WorkDevStackOverflowq47433632>"c:WorkDevVEnvspy35x64_testScriptspython.exe" code.py Python 3.5.4 (v3.5.4:3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32 {'29149862': {'title': 'Telemedicine as an effective intervention to improve ' 'antibiotic appropriateness prescription and to reduce ' 'costs in pediatrics.'}, '29150897': {'title': 'Determining best outcomes from community-acquired ' 'pneumonia and how to achieve them.'}}
首先,XML对案例敏感,您正在使用XPath中的小写标签。
我也相信pmid
应该是某个数字(或字符串代表数字),在您的情况下,这似乎是不同的:
在我的测试中
`pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])`
产生一串串联数字,这不是您要寻找的。