我对网络抓取很陌生,我从公告牌的一个网站上抓取,该网站汇编了从1958年到2021年每年的十大夏季歌曲。我的主要目标是最终得到一个以年号为键的字典和一个以10首歌为关联值的列表。
{"1958": ["NEL BLU DIPINTO DI BLU (VOLARÉ)", ...], "1959": ["LONELY BOY", ...]}
到目前为止,我所拥有的是每年及其歌曲的列表,其中列表中的每个值都是多行,如下所示:
1958Rank, Title, Artist
1, NEL BLU DIPINTO DI BLU (VOLARÉ), Domenico Modugno
2, POOR LITTLE FOOL, Ricky Nelson
3, PATRICIA, Perez Prado And His Orchestra
4, LITTLE STAR, The Elegants
5, MY TRUE LOVE, Jack Scott
6, JUST A DREAM, Jimmy Clanton And His Rockets
7, WHEN, Kalin Twins
8, BIRD DOG, The Everly Brothers
9, SPLISH SPLASH, Bobby Darin
10, REBEL-‘ROUSER, Duane Eddy His Twangy Guitar And The Rebels
是否有办法提取歌曲标题,并将它们添加到一个单独的列表?我认为它可以通过某种方式检查子字符串是否完全大写来完成,因为歌曲标题都是大写的,或者如果子字符串在两个逗号之间,因为标题放在逗号之间,在它的位置值之后,在歌曲标题的末尾。
公告牌网站的链接在这里:https://www.billboard.com/pro/summer歌曲- 1985 -现在-高级- 10 -音乐-夏天listen/
不需要regex
-要获得预期的输出,只选择具有<strong>
的<p>
并迭代其文本[s.split(', ')[1] for s in p.find_all(text=True)[2:]]
:
from bs4 import BeautifulSoup
import pandas as pd
import requests
doc = BeautifulSoup(requests.get(https://www.billboard.com/pro/summer-songs-1985-present-top-10-tunes-each-summer-listen/).text)
data = []
for p in doc.select('.pmc-paywall p:has(strong)'):
data.append({
p.strong.text:[s.split(', ')[1] for s in p.find_all(text=True)[2:]]
})
print(data)
输出:
[{'1958': ['NEL BLU DIPINTO DI BLU (VOLARÉ)', 'POOR LITTLE FOOL', 'PATRICIA', 'LITTLE STAR', 'MY TRUE LOVE', 'JUST A DREAM', 'WHEN', 'BIRD DOG', 'SPLISH SPLASH', 'REBEL-‘ROUSER']}, {'1959': ['LONELY BOY', 'THE BATTLE OF NEW ORLEANS', 'A BIG HUNK O’ LOVE', 'MY HEART IS AN OPEN BOOK', 'THE THREE BELLS', 'PERSONALITY', 'THERE GOES MY BABY', 'LAVENDER-BLUE', 'WATERLOO', 'TIGER']}, {'1960': ['I’M SORRY', 'IT’S NOW OR NEVER', 'EVERYBODY’S SOMEBODY’S FOOL', 'ALLEY-OOP', 'ITSY BITSY TEENIE WEENIE YELLOW POLKADOT BIKINI', 'ONLY THE LONELY (KNOW HOW I FEEL)', 'WALK — DON’T RUN', 'CATHY’S CLOWN', 'MULE SKINNER BLUES', 'BECAUSE THEY’RE YOUNG']},...]
获得更结构化的数据(包括级别和艺术家)的一种方法是:
...
data = []
for p in doc.select('.pmc-paywall p:has(strong)'):
for s in [dict(zip(p.find_all(text=True)[1].split(','),s.strip().split(', '))) for s in p.find_all(text=True)[2:]]:
s.update({'year':p.strong.text})
data.append(s)
pd.DataFrame(data)