删除维基百科信息(表)



我需要在维基百科上收集每个地区的Elenco dei comuni信息。我想创建一个数组,允许我将每个comune关联到相应的区域,例如:

'Abbateggio': 'Pescara' -> Abruzzo

我尝试使用BeautifulSouprequests获取信息,如下所示:

from bs4 import BeautifulSoup as bs
import requests
with requests.Session() as s: # use session object for efficiency of tcp re-use
s.headers = {'User-Agent': 'Mozilla/5.0'}
r = s.get('https://it.wikipedia.org/wiki/Comuni_d%27Italia')
soup = bs(r.text, 'html.parser')
for ele in soup.find_all('h3')[:6]:
tx = bs(str(ele),'html.parser').find('span', attrs={'class': "mw-headline"})
if tx is not None:
print(tx['id'])

但是它不起作用(它会返回一个空列表(。我使用谷歌Chrome的Inspect查看的信息如下:

<span class="mw-headline" id="Elenco_dei_comuni_per_regione">Elenco dei comuni per regione</span> (table)
<a href="/wiki/Comuni_dell%27Abruzzo" title="Comuni dell'Abruzzo">Comuni dell'Abruzzo</a> 

(此字段应针对每个区域更改(

<table class="wikitable sortable query-tablesortes">

你能就如何取得这样的结果给我一些建议吗?如有任何帮助和建议,我们将不胜感激。

编辑:

示例:

我有一个词:comunediabbateggio。这个词包括Abbateggio。我想知道哪个地区可以与那个城市联系在一起,如果它存在的话。来自维基百科的信息需要创建一个数据集,让我可以检查字段并关联到一个地区的社区/城市。我应该期待的是:

WORD                         REGION/STATE
comunediabbateggio           Pescara

我希望这能帮助你。如果不清楚,我很抱歉。另一个英语使用者可能更容易理解的例子如下:

除了上面的意大利链接,您还可以考虑以下内容:https://en.wikipedia.org/wiki/List_of_comuni_of_Italy。对于每个地区(伦巴第、威尼托、西西里…(,我都需要收集有关list of communes of the Provinces的信息。如果您单击List of Communes of ...的链接,则会有一个表列出comune,例如。https://en.wikipedia.org/wiki/List_of_communes_of_the_Province_of_Agrigento.

import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm

target = "https://en.wikipedia.org/wiki/List_of_comuni_of_Italy"

def main(url):
with requests.Session() as req:
r = req.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
provinces = [item.find_next("span").text for item in soup.findAll(
"span", class_="tocnumber", text=re.compile(r"d[.]d"))]
search = [item.replace(
" ", "_") if " " in item else item for item in provinces]
nested = []
for item in search:
for a in soup.findAll("span", id=item):
goes = [b.text.split("of ")[-1]
for b in a.find_next("ul").findAll("a")]
nested.append(goes)
dictionary = dict(zip(provinces, nested))
urls = [f'{url[:24]}{b.get("href")}' for item in search for a in soup.findAll(
"span", id=item) for b in a.find_next("ul").findAll("a")]
return urls, dictionary

def parser():
links, dics = main(target)
com = []
for link in tqdm(links):
try:
df = pd.read_html(link)[0]
com.append(df[df.columns[1]].to_list()[:-1])
except ValueError:
com.append(["N/A"])
com = iter(com)
for x in dics:
b = dics[x]
dics[x] = dict(zip(b, com))
print(dics)

parser()

最新更新