在python中读取xml文件时遇到问题



我想在Google Colab上使用python以这种方式读取这个xml文件:

import xml.etree.ElementTree as ET
tree = ET.parse('drive/MyDrive/pubmed22n1192.xml')

pubmed22n1192.xml是该文件的名称

但是我得到这个错误信息

File "<string>", line unknown
ParseError: syntax error: line 1, column 0

这个文件有什么问题吗?考虑到这个文件的大小,我分享了几行

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">14584002</PMID>
<DateCompleted>
<Year>2004</Year>
<Month>05</Month>
<Day>04</Day>
</DateCompleted>
<DateRevised>
<Year>2022</Year>
<Month>02</Month>
<Day>13</Day>
</DateRevised>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Electronic">1469-493X</ISSN>
<JournalIssue CitedMedium="Internet">
<Issue>4</Issue>
<PubDate>
<Year>2003</Year>
</PubDate>
</JournalIssue>
<Title>The Cochrane database of systematic reviews</Title>
<ISOAbbreviation>Cochrane Database Syst Rev</ISOAbbreviation>
</Journal>
<ArticleTitle>Intravenous immunoglobulin for the treatment of Kawasaki disease in children.</ArticleTitle>
<Pagination>
<MedlinePgn>CD004000</MedlinePgn>
</Pagination>
<Abstract>
<AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Kawasaki disease is the most common cause of acquired heart disease in children in developed countries. The coronary arteries supplying the heart can be damaged in Kawasaki disease. The principal advantage of timely diagnosis is the potential to prevent this complication with early treatment. Intravenous immunoglobulin (IVIG) is widely used for this purpose.</AbstractText>
<AbstractText Label="OBJECTIVES" NlmCategory="OBJECTIVE">The objective of this review was to evaluate the effectiveness of IVIG in treating, and preventing cardiac consequences, of Kawasaki disease in children.</AbstractText>
<AbstractText Label="SEARCH STRATEGY" NlmCategory="METHODS">Electronic searches of the Cochrane Peripheral Vascular Disease Group Specialised Register, CENTRAL, MEDLINE, EMBASE, and CINAHL were performed (last searched April 2003). We also searched references from relevant articles and contacted authors where necessary. In addition we contacted experts in the field for unpublished works.</AbstractText>
<AbstractText Label="SELECTION CRITERIA" NlmCategory="METHODS">Randomised controlled trials of intravenous immunoglobulin to treat Kawasaki disease were eligible for inclusion.</AbstractText>
<AbstractText Label="DATA COLLECTION AND ANALYSIS" NlmCategory="METHODS">Fifty-nine trials were identified in the initial search. On careful inspection only sixteen of these met all the inclusion criteria. Trials were data extracted and assessed for quality by at least two reviewers. Data were combined for meta-analysis using relative risk ratios for dichotomous data or weighted mean difference for continuous data. A random effects statistical model was used.</AbstractText>
<AbstractText Label="MAIN RESULTS" NlmCategory="RESULTS">The meta-analysis of IVIG versus placebo, including all children, showed a significant decrease in new coronary artery abnormalities (CAAs) in favour of IVIG, at thirty days RR (95% CI) = 0.74 (0.61 to 0.90). No statistically significant difference was found thereafter. A subgroup analysis excluding children with CAAs at enrollment also found a significant reduction of new CAAs in children receiving IVIG RR (95%) = 0.67 (0.46 to 1.00). There was a trend towards benefit from IVIG at sixty days (p=0.06). Results of dose comparisons showed a decrease in the number of new CAAs with increased dose. The meta-analysis of 400 mg/kg/day for five days versus 2 gm/kg in a single dose showed statistically significant reduction in CAAs at thirty days RR (95%) = 4.47 (1.55 to 12.86). This comparison also showed a significant reduction in duration of fever with the higher dose. There was no statistically significant difference noted between different preparations of IVIG. There was no statistically significant difference of adverse effects in any group.</AbstractText>
<AbstractText Label="REVIEWER'S CONCLUSIONS" NlmCategory="CONCLUSIONS">Children fulfilling the diagnostic criteria for Kawasaki disease should be treated with IVIG (2 gm/kg single dose) within 10 days of onset of symptoms.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Oates-Whitehead</LastName>
<ForeName>R M</ForeName>
<Initials>RM</Initials>
<AffiliationInfo>
<Affiliation>Research Division, Royal College of Paediatrics, 50 Hallam Street, London, UK, W1W 6DE.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Baumer</LastName>
<ForeName>J H</ForeName>
<Initials>JH</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Haines</LastName>
<ForeName>L</ForeName>
<Initials>L</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Love</LastName>
<ForeName>S</ForeName>
<Initials>S</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Maconochie</LastName>
<ForeName>I K</ForeName>
<Initials>IK</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Gupta</LastName>
<ForeName>A</ForeName>
<Initials>A</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Roman</LastName>
<ForeName>K</ForeName>
<Initials>K</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Dua</LastName>
<ForeName>J S</ForeName>
<Initials>JS</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Flynn</LastName>
<ForeName>I</ForeName>
<Initials>I</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D017418">Meta-Analysis</PublicationType>
<PublicationType UI="D016454">Review</PublicationType>
<PublicationType UI="D000078182">Systematic Review</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>England</Country>
<MedlineTA>Cochrane Database Syst Rev</MedlineTA>
<NlmUniqueID>100909747</NlmUniqueID>
<ISSNLinking>1361-6137</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D016756">Immunoglobulins, Intravenous</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D002648" MajorTopicYN="N">Child</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016756" MajorTopicYN="N">Immunoglobulins, Intravenous</DescriptorName>
<QualifierName UI="Q000627" MajorTopicYN="Y">therapeutic use</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D009080" MajorTopicYN="N">Mucocutaneous Lymph Node Syndrome</DescriptorName>
<QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016032" MajorTopicYN="N">Randomized Controlled Trials as Topic</DescriptorName>
</MeshHeading>
</MeshHeadingList>
<NumberOfReferences>90</NumberOfReferences>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="pubmed">
<Year>2003</Year>
<Month>10</Month>
<Day>30</Day>
<Hour>5</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2004</Year>
<Month>5</Month>
<Day>5</Day>
<Hour>5</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2003</Year>
<Month>10</Month>
<Day>30</Day>
<Hour>5</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">14584002</ArticleId>
<ArticleId IdType="doi">10.1002/14651858.CD004000</ArticleId>
<ArticleId IdType="pmc">PMC6544780</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>

这个文件包含一些文章的信息,这是第一个,所以不包括嗯,我在VScode上使用了xml扩展来查找一些格式错误,但似乎还可以

很难说没有完整的文件,但通过使用xml解析这个片段,我收到了一个xml.etree.ElementTree.ParseError: no element found错误,这让我认为xml可能格式错误。

在这种情况下,您可以使用Beautiful Soup,因为它对糟糕的xml更有弹性,事实上,当使用它时,它似乎返回了预期的结果。。

import bs4
xml = ...
soup = bs4.BeautifulSoup(xml, features="xml")
funny_chemical = soup.find("NameOfSubstance").text
print(funny_chemical)

退货:

'Immunoglobulins, Intravenous'

最新更新