如何从XML中提取多个孙子/孩子,其中一个孩子是一个特定的值?



我正在处理一个存储所有"版本"的XML文件。我们创造的聊天机器人。目前我们有18个版本,我只关心最新的一个。我正试图找到一种方法来提取所有的botDialogGroup元素以及它们相关的label元素为这个"v18"。'botDialogGroup'和'label'之间存在一对多的关系。

这是一段XML代码,其中botDialogGroup名为"Transfer"label叫"转一个问题"。这并不是只有一个版本的Bot,总共有18个。

链接到示例XML文件。https://pastebin.com/aaDfBPUm

还需要注意的是,fullNamebotVersions的子节点。而botDialogGrouplabelbotVersions的孙子代,它们的父代是botDialogs

<Bot>
<botVersions>
<fullName>v18</fullName>
<botDialogs>
<botDialogGroup>Transfer</botDialogGroup>
<botSteps>
<botVariableOperation>
<askCollectIfSet>false</askCollectIfSet>
<botMessages>
<message>Would you like to chat with an agent?</message>
</botMessages>
<botQuickReplyOptions>
<literalValue>Yes</literalValue>
</botQuickReplyOptions>
<botQuickReplyOptions>
<literalValue>No</literalValue>
</botQuickReplyOptions>
<botVariableOperands>
<disableAutoFill>true</disableAutoFill>
<sourceName>YesOrNoChoices</sourceName>
<sourceType>MlSlotClass</sourceType>
<targetName>Transfer_To_Agent</targetName>
<targetType>ConversationVariable</targetType>
</botVariableOperands>
<optionalCollect>false</optionalCollect>
<quickReplyType>Static</quickReplyType>
<quickReplyWidgetType>Buttons</quickReplyWidgetType>
<retryMessages>
<message>I&apos;m sorry, I didn&apos;t understand that. You have to select an option to proceed.</message>
</retryMessages>
<type>Collect</type>
</botVariableOperation>
<type>VariableOperation</type>
</botSteps>
<botSteps>
<botStepConditions>
<leftOperandName>Transfer_To_Agent</leftOperandName>
<leftOperandType>ConversationVariable</leftOperandType>
<operatorType>Equals</operatorType>
<rightOperandValue>No</rightOperandValue>
</botStepConditions>
<botSteps>
<botVariableOperation>
<botVariableOperands>
<targetName>Transfer_To_Agent</targetName>
<targetType>ConversationVariable</targetType>
</botVariableOperands>
<type>Unset</type>
</botVariableOperation>
<type>VariableOperation</type>
</botSteps>
<botSteps>
<botNavigation>
<botNavigationLinks>
<targetBotDialog>Main_Menu</targetBotDialog>
</botNavigationLinks>
<type>Redirect</type>
</botNavigation>
<type>Navigation</type>
</botSteps>
<type>Group</type>
</botSteps>
<botSteps>
<botStepConditions>
<leftOperandName>Transfer_To_Agent</leftOperandName>
<leftOperandType>ConversationVariable</leftOperandType>
<operatorType>Equals</operatorType>
<rightOperandValue>Yes</rightOperandValue>
</botStepConditions>
<botStepConditions>
<leftOperandName>Online_Product</leftOperandName>
<leftOperandType>ConversationVariable</leftOperandType>
<operatorType>NotEquals</operatorType>
<rightOperandValue>OTP</rightOperandValue>
</botStepConditions>
<botStepConditions>
<leftOperandName>Online_Product</leftOperandName>
<leftOperandType>ConversationVariable</leftOperandType>
<operatorType>NotEquals</operatorType>
<rightOperandValue>TCF</rightOperandValue>
</botStepConditions>
<botSteps>
<botVariableOperation>
<botVariableOperands>
<targetName>Transfer_To_Agent</targetName>
<targetType>ConversationVariable</targetType>
</botVariableOperands>
<type>Unset</type>
</botVariableOperation>
<type>VariableOperation</type>
</botSteps>
<botSteps>
<botNavigation>
<botNavigationLinks>
<targetBotDialog>Find_Business_Hours</targetBotDialog>
</botNavigationLinks>
<type>Call</type>
</botNavigation>
<type>Navigation</type>
</botSteps>
<type>Group</type>
</botSteps>
<botSteps>
<botNavigation>
<botNavigationLinks>
<targetBotDialog>Direct_Transfer</targetBotDialog>
</botNavigationLinks>
<type>Redirect</type>
</botNavigation>
<type>Navigation</type>
</botSteps>
<developerName>Transfer_To_Agent</developerName>
<label>Transfer with a question</label>
<mlIntent>Transfer_To_Agent</mlIntent>
<mlIntentTrainingEnabled>true</mlIntentTrainingEnabled>
<showInFooterMenu>false</showInFooterMenu>
</botDialogs>
</botVersions>
</Bot>

当前脚本

我的问题是,它将搜索整个树,所有18个版本,为botDialogGrouplabel元素,因为我使用findall()。而我只希望它搜索最近的fullNamebotVersions,在这种情况下是"v18"

手动输入"v18"这不是问题,因为我总是知道要找的版本。这很有用,因为不同的机器人有不同的版本。

import xml.etree.ElementTree as ET
import pandas as pd
cols = ["BotVersion", "DialogGroup", "Dialog"]
rows = []
tree = ET.parse('ChattyBot.xml')
root = tree.getroot()
for fullName in root.findall(".//fullName[.='v18']"):
for botDialogGroup in root.findall(".//botDialogGroup"):
for label in root.findall(".//label"):
print(fullName.text, botDialogGroup.text, label.text)
rows.append({"BotVersion": fullName.text,
"DialogGroup": botDialogGroup.text,
"Dialog": label.text})
df = pd.DataFrame(rows, columns=cols)
df.to_csv("botcsvfile.csv")

使用pandas将期望的最终结果保存到csv文件。

BotVersionDialogGroupDialog
v18TransferTransfer with a question

好,这段代码假设您的XML将是version, dialog1, dialog2, dialog3, version2, dialog1, dialog2, etc...的模式,如果不是这种情况,那么让我知道,我会重新评估代码。但基本上是遍历代码并创建两个版本的对话框组,然后按版本号排序。然后flatten得到一个嵌套的列表表单来创建pandas数据框架。

import xml.etree.ElementTree as ET
import pandas as pd
cols = ["BotVersion", "DialogGroup", "Dialog"]
rows = []
tree = ET.parse('test.xml')
root = tree.getroot()

for fullName in root.findall(".//botVersions"):
versions = list(fullName)
# creating the many to one relation between the versions and bot dialogs
grouping = []
relations = []
for i, tag in enumerate(versions):
if i == 0:
relations.append(tag)
elif tag.tag == 'fullName':
grouping.append(relations)
relations = []
relations.append(tag)
else:
relations.append(tag)
# edge case for end of list)
if i == len(versions) - 1:
grouping.append(relations)
#sorting by the text of the fullName tag to be able to slice the end for latest version
grouping.sort(key=lambda x: x[0].text)
rows = grouping[-1]
#flatening the text into rows for the pandas dataframe
version_number = rows[0].text
pandas_row = [version_number]
pandas_rows = []
for r in rows[1:]:
pandas_row = [version_number]
for child in r.iter():
if child.tag in ['botDialogGroup', 'label']:
pandas_row.append(child.text)
pandas_rows.append(pandas_row)
df = pd.DataFrame(pandas_rows, columns=cols)
print(df)
from lxml import etree
bots = """your xml above"""
cols = ["BotVersion", "DialogGroup", "Dialog"]
rows = []
ver = 'v18'
root = etree.XML(bots)
for entry in root.xpath(f"//botVersions[//fullName[.='{ver}']]"):
rows.append([ver,entry.xpath('//botDialogGroup/text()')[0],entry.xpath('//label/text()')[0]])
df = pd.DataFrame(rows, columns=cols)
df

输出应该是您期望的df。

相关内容

  • 没有找到相关文章

最新更新