从维基百科页面中删除表格数据



我正在学习如何将BeautifulSoup库与Python一起使用,为了练习,我试图从这个维基百科页面上删除流派标题:https://en.wikipedia.org/wiki/List_of_jazz_genres

我已经能够在我的代码中走到这一步:

from bs4 import BeautifulSoup
html = open("wiki-jazz.html", encoding="utf=8")
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all("table")[1]
td = table.find_all("td")
print(td)

表[1]包含我要访问的数据。更具体地说,我真的只需要位于这个标题属性中的数据:

</td>, <td><a href="/wiki/West_Coast_jazz" title="West Coast jazz">West Coast jazz</a>

我一直在绞尽脑汁想如何提取这些信息。我看过这里的其他帖子,但没能完全到达那里。非常感谢。

要打印表的第一列,可以对行(<tr>(进行迭代,然后获取行(<td>(的所有单元格。每行的第一个单元格是您的爵士乐流派:

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_jazz_genres'
soup = BeautifulSoup(requests.get(url).content, "html.parser")
table = soup.find_all("table")[1]
for row in table.find_all('tr')[1:]:    # <-- [1:] because we don't want the header
cells = [td.get_text(strip=True) for td in row.find_all('td')]
print(cells[0])

打印:

Acid jazz
Afro-Cuban jazz
Avant-garde jazz
Bebop
Bossa nova
British dance band
Cape jazz
Chamber jazz
Continental jazz
Cool jazz
Crossover jazz
Dark jazz/Doomjazz[1][2][3]
Dixieland
Electro Swing
Ethio jazz
Ethno jazz
European free jazz
Free funk
Free jazz
Frevo
Gypsy jazz
Hard bop
Hot club
Indo jazz
Jazz blues
Jazz-funk
Jazz fusion
Jazz rap
Jazz rock
Kansas City blues
Kansas City jazz
Latin jazz
M-Base
Mainstream jazz
Modal jazz
Neo-bop jazz
Neo-swing
Neo-bop jazz
Novelty ragtime
Nu jazz
Orchestral jazz
Post-bop
Punk jazz
Ragtime
Ska jazz
Smooth jazz
Soul jazz
Straight-ahead jazz
Stride jazz
Swing
Third stream
Trad jazz
Vocal jazz
West Coast jazz

您应该阅读Beautifulsoup文档,了解如何在href src etc 等标签中获取属性

在这里你可以使用

item[1].get(‘title’)

最新更新