Python xml to dataframe在br标签后包含文本



我正在尝试将xml文件转换为数据框架。原始xml文件为:https://www.assemblee-nationale.fr/dyn/opendata/CRSANR5L15S2017E1N001.xml。下面是一个例子:

<?xml version='1.0' encoding='UTF-8'?>
<compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel">
<contenu>
<point nivpoint="1" valeur_ptsodj="2" ordinal_prise="1" id_preparation="819547" ordre_absolu_seance="8" code_grammaire="TITRE_TEXTE_DISCUSSION" code_style="Titre" code_parole="" sommaire="1" id_syceron="981344" valeur="">
<orateurs/>
<texte>Déclaration de...</texte>
<paragraphe valeur_ptsodj="2" ordinal_prise="1" id_preparation="819550" ordre_absolu_seance="11" id_acteur="PA345619" id_mandat="-1" id_nomination_oe="PM725692" id_nomination_op="-1" code_grammaire="DEBAT_1_10" code_style="NORMAL" code_parole="PAROLE_1_2" sommaire="1" id_syceron="981347" valeur="">
<orateurs>
<orateur>
<nom>M. President</nom>
</orateur>
</orateurs>
<texte>Today we are...
<exposant>er</exposant>
Prime-minister will 
<br/>
speak.
</texte>
</paragraphe>
</point>
</contenu>
</compteRendu>

我代码:

import pandas as pd
import xml.etree.ElementTree as et
tree = ET.parse('file.xml')
root = tree.getroot()
d = {'contenu':['nom','texte']}
cols, data = list(), list()
# loop through d.items
for k, v in d.items():
# find child
child = root.find(f'{{*}}{k}')
# use iter to check each descendant (`elem`)
for elem in child.iter():
# get `tag_end` for each descendant,
tag_end = elem.tag.split('}')[-1]  
# check if `tag_end` in `v(alue)`
if tag_end in v:
# add `tag_end` and `elem.text` to appropriate list
cols.append(tag_end)
data.append(elem.text)
df = pd.DataFrame(data).T
# Obtain columns names
def f(lst):
d = {}
out = []
for i in lst:
if i not in d:
out.append(i)
d[i] = 2
else:
out.append(i+str(d[i]))
d[i] += 1
return out
df.columns = f(cols)
df.columns = f(cols)
df=df.rename(columns={"nom": "nom1"})
df.rename(columns={"texte"+str(i): "texte"+str(i-1) for i in range(2,10000)}, inplace=True)
df=df.rename(columns={"texte": "texte0"})
df.drop([col for col in df.columns if col.startswith("nom") and df[col].isnull().all()], axis=1, inplace=True)

我得到了什么:

texte0            nom1          texte1
Déclaration de... M. President Today we are...n

第二栏缺少"首相将发表讲话"的文字。由于<br><exposant>标签,只有第一行显示。我应该如何修改我的代码?

(最后,我将把我的数据框从宽转换为长,这样我就有一列'nom'和另一列'text ',人和他各自的文本。)

您可以使用递归函数获得具有所需标记的所有元素的text和它们的子元素的tail:

import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
tags = ['nom','texte']
def get_content_recursively(element, tags, get_tail=False):
data = list()
_, _, tag = element.tag.rpartition('}')
if tag in tags and element.text and element.text.strip():
data.append(element.text.strip())
for el in element:
data += get_content_recursively(el, tags, get_tail=(tag in tags))

if get_tail and element.tail and element.tail.strip():
data.append(element.tail.strip())
return data
df = pd.DataFrame(get_content_recursively(root, tags)).T

输出:

0             1                2                    3       4
0  Déclaration de...  M. President  Today we are...  Prime-minister will  speak.

data.append(element.text.strip())从结果中去掉空白(包括新行)。删除strip()以保留它们

编辑:如果你想连接一个元素的所有字符串,你可以在一个循环中处理它的text元素和tail子元素:

import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
tags = ['nom','texte']
def get_content_recursively(element, tags):
data = []
_, _, tag = element.tag.rpartition('}')
if tag in tags:
tag_str_lst = []
if element.text and element.text.strip():
tag_str_lst.append(element.text.strip())
for el in element:
if el.tail and el.tail.strip():
tag_str_lst.append(el.tail.strip())
data.append(" ".join(tag_str_lst))
for el in element:
data += get_content_recursively(el, tags)

return data
df = pd.DataFrame(get_content_recursively(root, tags)).T

输出:

0             1                                           2
0  Déclaration de...  M. President  Today we are... Prime-minister will speak.

最新更新