.find(text=True)如何在BeautifulSoup4中工作?

尝试从:https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes提取维基百科列表使用BeautifulSoup。

这是我的代码:

wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page)
table=soup.find('table', class_="wikitable sortable") # The class of the list in wikipedia
Data = [[] for _ in range(9)] # I intend to turn this into a DataFrame
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells)==9: # The start and end don't include a <td> tag
for i in range(9):
Data[i].append(cells[i].find(text=True))

除了names列中的单个值the hurricane "New England"之外，这个工作得很好。这是包含该元素的HTML代码:

<td><span data-sort-value="New England !"> <a href="/wiki/1938_New_England_hurricane" title="1938 New England hurricane">"New England"</a></span></td>

飓风中名称的条目是' '，我认为<span>和<a>之间的空格导致了这个问题。在。find中有办法解决这个问题吗?有没有更聪明的方法来访问维基百科中的列表?我怎样才能在将来避免这种情况呢?

将table读入数据帧的最简单方法是read_html():

import pandas as pd
pd.read_html(wiki)[1]

输出:

Name    Dates as aCategory 5    Duration as aCategory 5 Sustainedwind speeds    Pressure    Areas affected  Deaths  Damage(USD) Refs
0   "Cuba"  October 19, 1924    12 hours    165 mph (270 km/h)  910 hPa (26.87 inHg)    Central America, Mexico, CubaFlorida, The Bahamas   90  NaN [12]
1   "San Felipe IIOkeechobee"   September 13–14, 1928   12 hours    160 mph (260 km/h)  929 hPa (27.43 inHg)    Lesser Antilles, The BahamasUnited States East...   4000    NaN NaN

…

要改进您的示例，您可以做以下操作:

import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = requests.get(wiki).content
soup = BeautifulSoup(page,'lxml')
table=soup.find('table', class_="wikitable sortable") # The class of the list in wikipedia
data = []
for row in table.select('tr')[1:-1]:
cells = []
for cell in row.select('td'):
cells.append(cell.get_text('',strip=True))
data.append(cells)

get_text('',strip=True)将从td中获取文本，并删除前面/末尾的空间。

这将规范文本，希望能给你你正在寻找的:-

import urllib
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page, 'lxml')
# The class of the list in wikipedia
table = soup.find('table', class_="wikitable sortable")
Data = [[] for _ in range(9)]  # I intend to turn this into a DataFrame
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) == 9:  # The start and end don't include a <td> tag
for i, cell in enumerate(cells):
Data[i].append(cell.text.strip().replace('"', ''))
print(Data)

相关内容

最新更新

热门标签：