我需要从 https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 中刮一个表并将此数据存储在 Python 数据帧中。我已经拉了表格,但无法选择列(邮政编码、自治市镇、邻里(
我的表如下所示:
<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
...
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td')
Postcode = row.columns[1].get_text()
Borough = row.columns[2].get_text()
Neighbourhood = row.column[3].get_text()
df.append([Postcode,Borough,Neighbourhood])
使用上面的代码,我得到了类型错误:"NoneType"对象不可下标
我用谷歌搜索了一下,知道我做不到邮政编码 = 行列[1].get_text((因为函数的内联属性。
我也尝试了其他方法,但收到一些"索引错误消息"。
很简单。我需要遍历该行并继续为每行选择三列并将其存储在列表中。但是我无法用代码编写它。
预期输出为
Postcode Borough Neighbourhood
M1A Not assigned Not assigned
M2A Not assigned Not assigned
M3A North York Parkwoods
以下部分中的抓取代码错误。
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td') # the first row is returning <th> tags, but since you queried <td> tags, it's returning empty list.
if len(columns)>0: #In order to skip first row or in general, empty rows, you need to put an if check.
#Use the indices properly to get different values.
Postcode = columns[0].get_text()
Borough =columns[1].get_text()
Neighbourhood = columns[2].get_text()
df.append([Postcode,Borough,Neighbourhood])
然后,请注意,使用 get_text 也会完整地返回链接和锚标签。您可能希望更改代码以避免这种情况。快乐的网络抓取:)
认识熊猫,但我使用这个脚本来抓取表。希望对您有所帮助。
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
tbl= soup.find('table', {'class': 'wikitable sortable'})
table_dict = {
"head": [th.text.strip() for th in tbl.find_all('th')],
"rows": [
[td.text.strip() for td in tr.find_all("td")]
for tr in tbl.find_all("tr")
if not tr.find("th")
]
}
如果你想从网络上抓取表格,你可以使用熊猫库。
import pandas as pd
url = 'valid_url'
df = pd.read_html(url)
print(df[0].head())