用LXML解析PubMed API XML,然后将儿童抓住词典



我尝试重新学习python,以便缺乏技能。我目前正在与PubMed API一起玩。我正在尝试解析此处给出的XML文件,然后运行一个循环以通过每个孩子('/pubmedarmarticle')并抓住一些东西,现在只有文章标题,然后将它们输入到下面的字典中PubMedid的钥匙(PMID)。

即。输出应该看起来像:

{'29150897': {'title': 'Determining best outcomes from community-acquired pneumonia and how to achieve them.'} 
'29149862': {'title': 'Telemedicine as an effective intervention to improve antibiotic appropriateness prescription and to reduce costs in pediatrics.'}}

后来,我将添加更多因素,例如作者和日记等,现在我只想弄清楚如何使用LXML软件包将我想要的数据获取到字典中。我知道有很多包装可以为我做到这一点,但这却失去了学习的目的。我已经尝试了很多不同的事情,这就是我目前要做的:

from lxml import etree    
article_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
page = requests.get(article_url)
tree = etree.fromstring(page.content)
dict_out = {}
for x in tree.xpath('//PubmedArticle'):
    pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])
    title = ''.join([x.text for x in x.xpath('//ArticleTitle')])
    dict_out[pmid] = {'title': title}
print(dict_out)

我可能对如何进行此过程有误会,但是如果有人可以提供洞察力或带领我朝着正确的资源方向前进,这将不胜感激。

编辑:我很抱歉。我写的比应该更快。我已经修复了所有情况。另外,结果似乎将PMID结合在一起,而仅给出第一个标题:

{'2725403628806902': {'title': 'Handshake Stewardship: A Highly Effective Rounding-based Antimicrobial Optimization Service.Monitoring, documenting and reporting the quality of antibiotic use in the Netherlands: a pilot study to establish a national antimicrobial stewardship registry.'}}

ta

code.py

#!/usr/bin/env python3
import sys
import requests
from lxml import etree
from pprint import pprint as pp
ARTICLE_URL = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"

def main():
    response = requests.get(ARTICLE_URL)
    tree = etree.fromstring(response.content)
    ids = tree.xpath("//MedlineCitation/PMID[@Version='1']")
    titles = tree.xpath("//Article/ArticleTitle")
    if len(ids) != len(titles):
        print("ID count doesn't match Title count...")
        return
    result = {_id.text: {"title": title.text} for _id, title in zip(ids, titles)}
    pp(result)

if __name__ == "__main__":
    print("Python {:s} on {:s}n".format(sys.version, sys.platform))
    main()

注释

  • 我稍微构造了一些代码,并重命名了一些变量以确保清晰度
  • ids 保留 pmid 节点的列表,而 titles 保留(对应) articletitle 节点(注意路径!)
  • 以所需格式加入它们的方法是使用[Python]:Pep 274- dict enlastions,同时在2列表上迭代,[Python 3]: zip *Iterables )被使用

输出

(py35x64_test) c:WorkDevStackOverflowq47433632>"c:WorkDevVEnvspy35x64_testScriptspython.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32
{'29149862': {'title': 'Telemedicine as an effective intervention to improve '
                       'antibiotic appropriateness prescription and to reduce '
                       'costs in pediatrics.'},
 '29150897': {'title': 'Determining best outcomes from community-acquired '
                       'pneumonia and how to achieve them.'}}

首先,XML对案例敏感,您正在使用XPath中的小写标签。

我也相信pmid应该是某个数字(或字符串代表数字),在您的情况下,这似乎是不同的:

在我的测试中

`pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])` 

产生一串串联数字,这不是您要寻找的。

最新更新