将XML标记解析为pandas数据框架



我的XML文件如下。

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pbsCrew>
<schedulePeriod>
<startDate>2022-06-02</startDate>
<endDate>2022-07-01</endDate>
</schedulePeriod>
<crewMembers>
<crewMember>
<id>12345</id>
<firstName>John</firstName>
<middleName>S</middleName>
<lastName>Doe</lastName>
<seniorityNumber>2</seniorityNumber>
<base>DFW</base>
<division>I</division>
<seat>CA</seat>
<flightQualification>
<hitcities/>
<monthlyMax>9983</monthlyMax>
<volMax>0</volMax>
<restricted75HrCities>
<city>
<code>GUC</code>
</city>
<city>
<code>MSO</code>
</city>
</city>
</restricted75HrCities>
<equipmentTypes>
<equipment>
<type>777</type>
</equipment>
</equipmentTypes>
<green>false</green>
<isVaccinated>true</isVaccinated>
</flightQualification>
<bidQualification>
<canBid>true</canBid>
<canBeAwarded>true</canBeAwarded>
</bidQualification>
</crewMember>
<crewMember>
<id>22222</id>
<firstName>JANE</firstName>
<middleName>R</middleName>
<lastName>DOE</lastName>
<seniorityNumber>8</seniorityNumber>
<base>DFW</base>
<division>I</division>
<seat>CA</seat>
<flightQualification>
<hitcities>
<hitcity>
<code>OAX</code>
</hitcity>
<hitcity>
<code>MSO</code>
</hitcity>
<hitcity>
<code>US</code>
</hitcity>
</hitcities>
<monthlyMax>7642</monthlyMax>
<volMax>0</volMax>
<restricted75HrCities/>
<equipmentTypes>
<equipment>
<type>787</type>
</equipment>
</equipmentTypes>
<green>false</green>
<trainerLineCheck>false</trainerLineCheck>
<trainerLineIndoctrination>false</trainerLineIndoctrination>
<isVaccinated>true</isVaccinated>
</flightQualification>
<bidQualification>
<canBid>true</canBid>
<canBeAwarded>true</canBeAwarded>
</bidQualification>
</crewMember>
</crewMembers>
</pbsCrew>

我想解析船员标记并获取id, firstName, lastName从那里和接种flightQualification全部放入一个Pandas数据框架中,如下:

tbody> <<tr>
ID名字姓氏isvaccination
12345约翰
22222

虽然可以使用循环使用elementtree或lxml库来实现,但是可以使用使用样式表属性的pandas read_xml()方法来实现相同的目标。此属性指定用于转换原始xml的XSL样式表。在您的情况下,需要将其扁平化,使所有感兴趣的元素都包含在作为行的父元素中。

示例如下:

xml = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pbsCrew>
<schedulePeriod>
<startDate>2022-06-02</startDate>
<endDate>2022-07-01</endDate>
</schedulePeriod>
<crewMembers>
<crewMember>
<id>12345</id>
<firstName>John</firstName>
<middleName>S</middleName>
<lastName>Doe</lastName>
<seniorityNumber>2</seniorityNumber>
<base>DFW</base>
<division>I</division>
<seat>CA</seat>
<flightQualification>
<hitcities/>
<monthlyMax>9983</monthlyMax>
<volMax>0</volMax>
<restricted75HrCities>
<city>
<code>GUC</code>
</city>
<city>
<code>MSO</code>
</city>
</restricted75HrCities>
<equipmentTypes>
<equipment>
<type>777</type>
</equipment>
</equipmentTypes>
<green>false</green>
<isVaccinated>true</isVaccinated>
</flightQualification>
<bidQualification>
<canBid>true</canBid>
<canBeAwarded>true</canBeAwarded>
</bidQualification>
</crewMember>
<crewMember>
<id>22222</id>
<firstName>JANE</firstName>
<middleName>R</middleName>
<lastName>DOE</lastName>
<seniorityNumber>8</seniorityNumber>
<base>DFW</base>
<division>I</division>
<seat>CA</seat>
<flightQualification>
<hitcities>
<hitcity>
<code>OAX</code>
</hitcity>
<hitcity>
<code>MSO</code>
</hitcity>
<hitcity>
<code>US</code>
</hitcity>
</hitcities>
<monthlyMax>7642</monthlyMax>
<volMax>0</volMax>
<restricted75HrCities/>
<equipmentTypes>
<equipment>
<type>787</type>
</equipment>
</equipmentTypes>
<green>false</green>
<trainerLineCheck>false</trainerLineCheck>
<trainerLineIndoctrination>false</trainerLineIndoctrination>
<isVaccinated>true</isVaccinated>
</flightQualification>
<bidQualification>
<canBid>true</canBid>
<canBeAwarded>true</canBeAwarded>
</bidQualification>
</crewMember>
</crewMembers>
</pbsCrew>
'''
import pandas as pd
stylesheet = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:template match="/">
<members>
<xsl:apply-templates select="//crewMember"/>
</members>
</xsl:template>
<xsl:template match="//crewMember">
<crewMember>
<ID>
<xsl:value-of select="./id"/>
</ID>
<firstname>
<xsl:value-of select="./firstName"/>
</firstname>
<lastname>
<xsl:value-of select="./lastName"/>
</lastname>
<isVaccinated>
<xsl:value-of select="./flightQualification/isVaccinated"/>
</isVaccinated>
</crewMember>
</xsl:template>
</xsl:stylesheet>'''
df = pd.read_xml(xml, xpath="//crewMember", stylesheet = stylesheet)
print(df)

这里样式表将原始XML转换为以下形式:

<?xml version="1.0" encoding="UTF-8"?>
<members>
<crewMember>
<ID>12345</ID>
<firstname>John</firstname>
<lastname>Doe</lastname>
<isVaccinated>true</isVaccinated>
</crewMember>
<crewMember>
<ID>22222</ID>
<firstname>JANE</firstname>
<lastname>DOE</lastname>
<isVaccinated>true</isVaccinated>
</crewMember>
</members>

这允许我们使用xpath//crewMember",将crewMember元素的子元素作为数据框的行字段。

最新更新