用LXML解析PubMed API XML，然后将儿童抓住词典

我尝试重新学习python，以便缺乏技能。我目前正在与PubMed API一起玩。我正在尝试解析此处给出的XML文件，然后运行一个循环以通过每个孩子（'/pubmedarmarticle'）并抓住一些东西，现在只有文章标题，然后将它们输入到下面的字典中PubMedid的钥匙（PMID）。

即。输出应该看起来像：

{'29150897': {'title': 'Determining best outcomes from community-acquired pneumonia and how to achieve them.'} 
'29149862': {'title': 'Telemedicine as an effective intervention to improve antibiotic appropriateness prescription and to reduce costs in pediatrics.'}}

后来，我将添加更多因素，例如作者和日记等，现在我只想弄清楚如何使用LXML软件包将我想要的数据获取到字典中。我知道有很多包装可以为我做到这一点，但这却失去了学习的目的。我已经尝试了很多不同的事情，这就是我目前要做的：

from lxml import etree    
article_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
page = requests.get(article_url)
tree = etree.fromstring(page.content)
dict_out = {}
for x in tree.xpath('//PubmedArticle'):
    pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])
    title = ''.join([x.text for x in x.xpath('//ArticleTitle')])
    dict_out[pmid] = {'title': title}
print(dict_out)

我可能对如何进行此过程有误会，但是如果有人可以提供洞察力或带领我朝着正确的资源方向前进，这将不胜感激。

编辑：我很抱歉。我写的比应该更快。我已经修复了所有情况。另外，结果似乎将PMID结合在一起，而仅给出第一个标题：

{'2725403628806902': {'title': 'Handshake Stewardship: A Highly Effective Rounding-based Antimicrobial Optimization Service.Monitoring, documenting and reporting the quality of antibiotic use in the Netherlands: a pilot study to establish a national antimicrobial stewardship registry.'}}

code.py ：

#!/usr/bin/env python3
import sys
import requests
from lxml import etree
from pprint import pprint as pp
ARTICLE_URL = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"

def main():
    response = requests.get(ARTICLE_URL)
    tree = etree.fromstring(response.content)
    ids = tree.xpath("//MedlineCitation/PMID[@Version='1']")
    titles = tree.xpath("//Article/ArticleTitle")
    if len(ids) != len(titles):
        print("ID count doesn't match Title count...")
        return
    result = {_id.text: {"title": title.text} for _id, title in zip(ids, titles)}
    pp(result)

if __name__ == "__main__":
    print("Python {:s} on {:s}n".format(sys.version, sys.platform))
    main()

注释：

我稍微构造了一些代码，并重命名了一些变量以确保清晰度
ids 保留 pmid 节点的列表，而 titles 保留（对应） articletitle 节点（注意路径！）
以所需格式加入它们的方法是使用[Python]：Pep 274- dict enlastions，同时在2列表上迭代，[Python 3]： zip *Iterables ）被使用

输出：

(py35x64_test) c:WorkDevStackOverflowq47433632>"c:WorkDevVEnvspy35x64_testScriptspython.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32
{'29149862': {'title': 'Telemedicine as an effective intervention to improve '
                       'antibiotic appropriateness prescription and to reduce '
                       'costs in pediatrics.'},
 '29150897': {'title': 'Determining best outcomes from community-acquired '
                       'pneumonia and how to achieve them.'}}

首先，XML对案例敏感，您正在使用XPath中的小写标签。

我也相信pmid应该是某个数字（或字符串代表数字），在您的情况下，这似乎是不同的：

在我的测试中

`pmid = ''.join([x.text for x in x.xpath('//MedlineCitation/PMID[@Version="1"]')])`

产生一串串联数字，这不是您要寻找的。

相关内容

最新更新

热门标签：