我正在尝试对由以下行组成的数据集进行NLP
00001 B 74457
00002 C 12804123 16026213 14627885
00004 A 15329425 9058342 11279767
其中行中第一个元素是标识符,第二个元素是推荐的标签,它只能有三个标签$ a, B, C$和数字(例如12804123)表示XML的id,它包含数据,例如文本,位置等。基于此,我需要从XML文件中提取数据并使用它来创建模型。首先,我想从XML文件中提取一些数据,并制作一个结构数据的数据框架。下面是XML文件的一个示例。当我运行命令pd.read_xml(xml)时,它给出
medlinecitation pubmeddata
0 NaN NaN
从Kaggle或任何其他来源等任何例子,我可以遵循做分析。
74457.xml = '''
<pubmedarticleset>
<pubmedarticle>
<medlinecitation owner="NLM" status="MEDLINE">
<pmid version="1"> 74457 </pmid>
<datecreated>
<year> 1978 </year>
<month> 03 </month>
<day> 21 </day>
</datecreated>
<datecompleted>
<year> 1978 </year>
<month> 03 </month>
<day> 21 </day>
</datecompleted>
<daterevised>
<year> 2007 </year>
<month> 11 </month>
<day> 15 </day>
</daterevised>
<article pubmodel="Print">
<journal>
<issn issntype="Print"> 0140-6736 </issn>
<journalissue citedmedium="Print">
<volume> 1 </volume>
<issue> 7984 </issue>
<pubdate>
<year> 1976 </year>
<month> Sep </month>
<day> 4 </day>
</pubdate>
</journalissue>
<title> Lancet </title>
<isoabbreviation> Lancet </isoabbreviation>
</journal>
<articletitle>
Prophylactic treatment of alcoholism by lithium carbonate. A controlled study.
</articletitle>
<pagination>
<medlinepgn> 481-2 </medlinepgn>
</pagination>
<abstract>
<abstracttext>
Lithium therapy has been shown to have a therapeutic influence in reducing the drinking and incapacity by alcohol in depressive alcoholics in a prospective double-blind placebo-controlled trial conducted over one year, but it had no significant effect on non-depressed patients. Patients in the trial treated by placebo had significantly greater alcoholic morbidity if they were depressive than if they were non-depressive.
</abstracttext>
</abstract>
<authorlist completeyn="Y">
<author validyn="Y">
<lastname> Merry </lastname>
<forename> J </forename>
<initials> J </initials>
</author>
<author validyn="Y">
<lastname> Reynolds </lastname>
<forename> C M </forename>
<initials> CM </initials>
</author>
<author validyn="Y">
<lastname> Bailey </lastname>
<forename> J </forename>
<initials> J </initials>
</author>
<author validyn="Y">
<lastname> Coppen </lastname>
<forename> A </forename>
<initials> A </initials>
</author>
</authorlist>
<language> eng </language>
<publicationtypelist>
<publicationtype> Clinical Trial </publicationtype>
<publicationtype> Comparative Study </publicationtype>
<publicationtype> Journal Article </publicationtype>
<publicationtype> Randomized Controlled Trial </publicationtype>
</publicationtypelist>
</article>
<medlinejournalinfo>
<country> ENGLAND </country>
<medlineta> Lancet </medlineta>
<nlmuniqueid> 2985213R </nlmuniqueid>
<issnlinking> 0140-6736 </issnlinking>
</medlinejournalinfo>
<chemicallist>
<chemical>
<registrynumber> 0 </registrynumber>
<nameofsubstance> Placebos </nameofsubstance>
</chemical>
<chemical>
<registrynumber> 7439-93-2 </registrynumber>
<nameofsubstance> Lithium </nameofsubstance>
</chemical>
</chemicallist>
<citationsubset> AIM </citationsubset>
<citationsubset> IM </citationsubset>
<meshheadinglist>
<meshheading>
<descriptorname majortopicyn="N"> Adult </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Alcohol Drinking </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Alcoholism </descriptorname>
<qualifiername majortopicyn="Y"> drug therapy </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Clinical Trials as Topic </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Depression </descriptorname>
<qualifiername majortopicyn="N"> chemically induced </qualifiername>
<qualifiername majortopicyn="Y"> prevention & control </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Double-Blind Method </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Drug Evaluation </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Female </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Humans </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Lithium </descriptorname>
<qualifiername majortopicyn="Y"> therapeutic use </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Male </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Middle Aged </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Placebos </descriptorname>
</meshheading>
</meshheadinglist>
</medlinecitation>
<pubmeddata>
<history>
<pubmedpubdate pubstatus="pubmed">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
</pubmedpubdate>
<pubmedpubdate pubstatus="medline">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
<hour> 0 </hour>
<minute> 1 </minute>
</pubmedpubdate>
<pubmedpubdate pubstatus="entrez">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
<hour> 0 </hour>
<minute> 0 </minute>
</pubmedpubdate>
</history>
<publicationstatus> ppublish </publicationstatus>
<articleidlist>
<articleid idtype="pubmed"> 74457 </articleid>
</articleidlist>
</pubmeddata>
</pubmedarticle>
</pubmedarticleset>'''
请帮助我了解发生了什么事?我怎么把它变成一个数据帧呢?
有一种方法:
import pandas as pd
try:
medlinecitation = pd.read_xml("74457.xml", xpath=".//medlinecitation").dropna(
axis=1
)
except ValueError:
medlinecitation = pd.DataFrame()
try:
pubmedpubdate = pd.read_xml("74457.xml", xpath=".//pubmedpubdate")
except ValueError:
pubmedpubdate = pd.DataFrame()
df = pd.merge(
left=medlinecitation,
right=pubmedpubdate,
how="outer",
left_index=True,
right_index=True,
).fillna(method="ffill")
print(df)
# Output
owner status pmid citationsubset pubstatus year month day hour
0 NLM MEDLINE 74457.0 IM pubmed 1976 9 4 NaN
1 NLM MEDLINE 74457.0 IM medline 1976 9 4 0.0
2 NLM MEDLINE 74457.0 IM entrez 1976 9 4 0.0
minute
0 NaN
1 1.0
2 0.0