如何从XML中获得两个包含缺少值的标记内容的列表



我有一个几千条记录的XML文件,我想从中提取:

  • 城市:标签110代码c(例如柏林(
  • 库代码:标签110代码g(例如D-Bbbf(

我想在库代码旁边获得所有城市的数据帧但是如果库代码(代码="g"(不存在,那么我希望NaN或其他数据表明没有值。例如

df = {'Cities': [Berlin, London], 'Codes': [D-Bbbf, NaN]}

这是XML:的一部分

<marc:record>
<marc:controlfield tag="001">39612</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:controlfield tag="008">161109n|||||||a|||              a</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
<marc:subfield code="c">Berlin</marc:subfield>
<marc:subfield code="g">D-Bbbf</marc:subfield>
</marc:datafield>
</marc:record><marc:record>
<marc:controlfield tag="001">30006648</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">The National Archives</marc:subfield>
<marc:subfield code="c">London</marc:subfield>
</marc:datafield>
</marc:record>

这就是我尝试过的:

# Import BeautifulSoup
from bs4 import BeautifulSoup
Data= {'Cities':[],
'Code':[]}
# Read the XML file
with open('oefen.xml', 'r', encoding="utf8") as f_in:
soup = BeautifulSoup(f_in.read(), 'html.parser')   

for record in soup.find_all(tag="110"):
find = record.find_all('[code="g"]')
for code in record:
if find is not None:
City = record.select_one('[code="c"]') # select city
Code = record.select_one('[code="g"]') # select code
Data['Cities'].append(City.get_text(strip=True))
Data['Code'].append(Code.get_text(strip=True))      
else:
print(NaN)
print(Data)

认为没有必要使用这些列表,使用一个dicts列表更容易-在迭代记录时,检查您要查找的元素是否可以附加其文本或None:

for record in soup.find_all('marc:record'):
data.append({
'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None  # select code
})

示例

xml='''
<marc:record>
<marc:controlfield tag="001">39612</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:controlfield tag="008">161109n|||||||a|||              a</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
<marc:subfield code="c">Berlin</marc:subfield>
<marc:subfield code="g">D-Bbbf</marc:subfield>
</marc:datafield>
</marc:record><marc:record>
<marc:controlfield tag="001">30006648</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">The National Archives</marc:subfield>
<marc:subfield code="c">London</marc:subfield>
</marc:datafield>
</marc:record>
'''
# Import BeautifulSoup
from bs4 import BeautifulSoup
data = []
soup = BeautifulSoup(xml,'lxml')
for record in soup.find_all('marc:record'):
data.append({
'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None  # select code
})
pd.DataFrame(data)

输出

城市代码
柏林D-Bbbf
伦敦

最新更新