我如何用数据列表而不是行来创建维基百科表



我正试图从维基百科上的Localities表中获取数据https://en.wikipedia.org/wiki/Districts_of_Warsaw页

我想收集这些数据,并将其放入具有两列["地区"]和["邻居"]的数据帧中。

到目前为止,我的代码如下:

url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html")
table = soup.find_all('table')[2]
A=[]
B=[]
for row in table.findAll('tr'):
cells=row.findAll('td')
if len(cells)==2:
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
df=pd.DataFrame(A,columns=['Neighbourhood'])
df['District']=B
print(df)

这给出了以下数据帧:

数据帧

当然,删除Neighbourhood列是不对的,因为它们包含在列表中,但我不知道应该怎么做,所以我很乐意提供任何提示。

除此之外,我将感谢任何暗示,为什么刮刮只给我10个区,而不是18个区。

您确定您正在抓取正确的表吗?我知道你需要一张第二张桌子,上面有18个区和列出的街区。

此外,我不确定你想如何在DataFrame中排列地区和街区,我已经将地区设置为列,将街区设置为行。你可以随心所欲地改变它。

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find_all("table")[1]
def process_list(tr):
result = []
for td in tr.findAll("td"):
result.append([x.string for x in td.findAll("li")])
return result
districts = []
neighbourhoods = []
for row in table.findAll("tr"):
if row.find("ul"):
neighbourhoods.extend(process_list(row))
else:
districts.extend([x.string.strip() for x in row.findAll("th")])
# Check and arrange as you wish
for i in range(len(districts)):
print(f'District {districts[i]} has neighbourhoods: {", ".join(neighbourhoods[i])}')
df = pd.DataFrame()
for i in range(len(districts)):
df[districts[i]] = pd.Series(neighbourhoods[i])

一些提示:

  • 使用element.string从元素中获取文本
  • 使用string.strip()删除任何前导字符(开头有空格(和尾随字符(结尾有空格((空格是要删除的默认前导字符(,即清除文本

您可以使用奇数行是District,偶数行是Neighbourhood这一事实来遍历奇数行,并使用FindNext从下面的行获取邻域,同时迭代奇数行中的District列:

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from itertools import zip_longest
soup = bs(requests.get('https://en.wikipedia.org/wiki/Districts_of_Warsaw').content, 'lxml')
table = soup.select_one('h2:contains("Localities") ~ .wikitable') #isolate table of interest
results = []
for row in table.select('tr')[0::2]: #walk the odd rows
for i in row.select('th'): #walk the districts
r = list(zip_longest([i.text.strip()] , [i.text for i in row.findNext('tr').select('li')], fillvalue=i.text.strip())) # zip the current district to the list of neighbourhoods in row below. Fill with District name to get lists of equal length
results.append(r)

results = [i for j in results for i in j] #flatten list of lists
df = pd.DataFrame(results, columns= ['District','Neighbourhood'])
print(df)

最新更新