我有一个几千条记录的XML文件,我想从中提取:
- 城市:标签110代码c(例如柏林(
- 库代码:标签110代码g(例如D-Bbbf(
我想在库代码旁边获得所有城市的数据帧但是如果库代码(代码="g"(不存在,那么我希望NaN或其他数据表明没有值。例如
df = {'Cities': [Berlin, London], 'Codes': [D-Bbbf, NaN]}
这是XML:的一部分
<marc:record>
<marc:controlfield tag="001">39612</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:controlfield tag="008">161109n|||||||a||| a</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
<marc:subfield code="c">Berlin</marc:subfield>
<marc:subfield code="g">D-Bbbf</marc:subfield>
</marc:datafield>
</marc:record><marc:record>
<marc:controlfield tag="001">30006648</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">The National Archives</marc:subfield>
<marc:subfield code="c">London</marc:subfield>
</marc:datafield>
</marc:record>
这就是我尝试过的:
# Import BeautifulSoup
from bs4 import BeautifulSoup
Data= {'Cities':[],
'Code':[]}
# Read the XML file
with open('oefen.xml', 'r', encoding="utf8") as f_in:
soup = BeautifulSoup(f_in.read(), 'html.parser')
for record in soup.find_all(tag="110"):
find = record.find_all('[code="g"]')
for code in record:
if find is not None:
City = record.select_one('[code="c"]') # select city
Code = record.select_one('[code="g"]') # select code
Data['Cities'].append(City.get_text(strip=True))
Data['Code'].append(Code.get_text(strip=True))
else:
print(NaN)
print(Data)
认为没有必要使用这些列表,使用一个dicts列表更容易-在迭代记录时,检查您要查找的元素是否可以附加其文本或None
:
for record in soup.find_all('marc:record'):
data.append({
'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None # select code
})
示例
xml='''
<marc:record>
<marc:controlfield tag="001">39612</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:controlfield tag="008">161109n|||||||a||| a</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
<marc:subfield code="c">Berlin</marc:subfield>
<marc:subfield code="g">D-Bbbf</marc:subfield>
</marc:datafield>
</marc:record><marc:record>
<marc:controlfield tag="001">30006648</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">The National Archives</marc:subfield>
<marc:subfield code="c">London</marc:subfield>
</marc:datafield>
</marc:record>
'''
# Import BeautifulSoup
from bs4 import BeautifulSoup
data = []
soup = BeautifulSoup(xml,'lxml')
for record in soup.find_all('marc:record'):
data.append({
'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None # select code
})
pd.DataFrame(data)
输出
城市 | 代码 |
---|---|
柏林 | D-Bbbf |
伦敦 |