使用BeautifulSoup抓取网站时出错



我正试图从天才那里搜集一些歌曲。我创建了以下方法:

import requests
from bs4 import BeautifulSoup
def get_song_lyrics(link):

response = requests.get(link)
soup = BeautifulSoup(response.text, "html.parser")
lyrics = soup.find("div",attrs={'class':'lyrics'}).find("p").get_text()
return [i for i in lyrics.splitlines()] 

我不明白为什么这个

get_song_lyrics('https://genius.com/Kanye-west-black-skinhead-lyrics')

退货:

属性错误:"NoneType"对象没有属性"find"

而这:

get_song_lyrics('https://genius.com/Kanye-west-hold-my-liquor-lyrics')

正确返回歌曲的歌词。两个页面的布局相同。有人能帮我弄清楚吗?

页面返回两个版本的HTML。您可以使用此脚本来处理这两个问题:

import requests
from bs4 import BeautifulSoup

url = 'https://genius.com/Kanye-west-black-skinhead-lyrics'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
for tag in soup.select('div[class^="Lyrics__Container"], .song_body-lyrics p'):
for i in tag.select('i'):
i.unwrap()
tag.smooth()
t = tag.get_text(strip=True, separator='n')
if t:
print(t)

打印:

[Produced By Daft Punk & Kanye West]
[Verse 1]
For my theme song (Black)
My leather black jeans on (Black)
My by-any-means on
...and so on.

我不确定是什么原因导致了它,但看起来BeautifulSoup有时成功,有时不成功,这与您的代码无关。如果代码不成功,一种解决方法是再次运行该函数:

import requests
from bs4 import BeautifulSoup
def get_song_lyrics(link):

response = requests.get(link)
soup = BeautifulSoup(response.text, "html.parser")
try:
lyrics = soup.find("div",attrs={'class':'lyrics'}).find("p").get_text()
return [i for i in lyrics.splitlines()] 
except AttributeError:
return get_song_lyrics(link)

get_song_lyrics('https://genius.com/Kanye-west-black-skinhead-lyrics')

最新更新