我对网络抓取世界还很陌生,所以我正在寻找一些指导,以解决我几个小时来一直在努力解决的问题。
我正试图循环遍历一个看起来像表的结构(虽然它不是一个实际的表(,并使用findall来带回某个标记的所有细节。
我面临的挑战是;表";具有相同的类名";最终排行榜_内容"所以我留下了一个巨大的列表,所以我想迭代并检索的详细信息,这样我就可以创建一个包含详细信息的csv/excel。这是下面的代码
from bs4 import BeautifulSoup
import requests
TournamentURL = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/"
TournamentResponse = requests.get(TournamentURL)
TournamentData = TournamentResponse.text
TournamentSoup = BeautifulSoup(TournamentData, 'html.parser')
RowContents = TournamentSoup.findAll("div", {"class": "final-leaderboard__content"})
for RowContent in RowContents:
结果是这样的,如果没有任何明确的标签/id,我无法找到最好的方法来知道项目0,8,16等是玩家名称,项目1,9,17等是完成等
[0] - Name
[1] - Finish
[2] - R1
[3] - R2
[4] - R3
[5] - R4
[6] - Total
[7] - Par
[8] - Name (The second Name)
[9] - Finish (The second Finish)
etc
etc
我尝试过拼接、模和其他各种相同的变体,但似乎无法解决。
您可以利用这是实际上的一种表格数据的事实,获取表示一行的所有divs
,将其拆分为多列,这样就有了您的数据:
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/#leaderboard"
page = requests.get(url).content
leaderboard = BeautifulSoup(page, "html.parser").find_all("div", {"class": "final-leaderboard__content"})
column_count = 8
split_by_columns = [
leaderboard[i:i+column_count] for i in range(0, len(leaderboard), column_count)
]
table = [[i.getText(strip=True) for i in row] for row in split_by_columns]
print(tabulate(table[1:], headers=table[0]))
输出:
Name Finish R1 R2 R3 R4 Total Par
----------------------------- -------- ---- ---- ---- ---- ------- -----
Jamie ANDERSONChampion Golfer 1 84 85 - - 169 M/C
Andrew KIRKALDY 2 86 86 - - 172 M/C
Jamie ALLAN 2 88 84 - - 172 M/C
George PAXTON 4 89 85 - - 174 M/C
Tom KIDD 5 87 88 - - 175 M/C
Bob FERGUSON 6 89 87 - - 176 M/C
J.O.F. MORRIS 7 92 87 - - 179 M/C
Jack KIRKALDY 8 92 89 - - 181 M/C
James RENNIE 8 93 88 - - 181 M/C
Willie FERNIE 8 92 89 - - 181 M/C
David AYTON 11 95 89 - - 184 M/C
Henry LAMB 11 91 93 - - 184 M/C
Tom ARUNDEL 11 95 89 - - 184 M/C
Tom MORRIS SR 14 92 93 - - 185 M/C
William DOLEMAN 14 91 94 - - 185 M/C
Robert KINSMAN 14 88 97 - - 185 M/C
Bob MARTIN 17 93 93 - - 186 M/C
Ben SAYERS 18 92 95 - - 187 M/C
David ANDERSON SR 19 94 94 - - 188 M/C
David CORSTORPHINE 20 93 96 - - 189 M/C
Tom DUNN 20 90 99 - - 189 M/C
Peter PAXTON 20 99 90 - - 189 M/C
[A] SMITH 20 94 95 - - 189 M/C
D. GRANT 20 95 94 - - 189 M/C
Bob DOW 20 95 94 - - 189 M/C
Walter GOURLAY 20 92 97 - - 189 M/C
A.W. SMITH 27 91 99 - - 190 M/C
Douglas Argyll ROBERTSON 27 97 93 - - 190 M/C
Robert ARMIT 29 95 96 - - 191 M/C
George STRATH 29 97 94 - - 191 M/C
J.H. BLACKWELL 31 96 96 - - 192 M/C
Tom MANZIE 32 96 97 - - 193 M/C
George LOWE 33 94 100 - - 194 M/C
G. HONEYMAN 33 97 97 - - 194 M/C
James FENTON 35 99 97 - - 196 M/C
Robert TAIT 35 99 97 - - 196 M/C
Bob KIRK 37 99 98 - - 197 M/C
Rev. D. LUNDIE 37 98 99 - - 197 M/C
Fitz BOOTHBY 39 96 102 - - 198 M/C
J. Thomson WHITE 40 102 99 - - 201 M/C
James KIRK 41 105 97 - - 202 M/C
W.H. GOFF 42 105 99 - - 204 M/C
import requests
from bs4 import BeautifulSoup
def parse_row(row):
for div in row.find_all("div", {"class": "final-leaderboard__content"}):
yield div.text.strip().replace('n', ' ')
url = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/#leaderboard"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find("div", {"class": "final-leaderboard__table"})
rows = table.find_all('div', {'class':"final-leaderboard__row"})
header = list(parse_row(rows[0]))
for row in rows[1:]:
print(dict(zip(header, list(parse_row(row)))))
输出
{'Name': 'Jamie ANDERSON Champion Golfer', 'Finish': '1', 'R1': '84', 'R2': '85', 'R3': '-', 'R4': '-', 'Total': '169', 'Par': 'M/C'}
{'Name': 'Andrew KIRKALDY', 'Finish': '2', 'R1': '86', 'R2': '86', 'R3': '-', 'R4': '-', 'Total': '172', 'Par': 'M/C'}
{'Name': 'Jamie ALLAN', 'Finish': '2', 'R1': '88', 'R2': '84', 'R3': '-', 'R4': '-', 'Total': '172', 'Par': 'M/C'}
{'Name': 'George PAXTON', 'Finish': '4', 'R1': '89', 'R2': '85', 'R3': '-', 'R4': '-', 'Total': '174', 'Par': 'M/C'}
{'Name': 'Tom KIDD', 'Finish': '5', 'R1': '87', 'R2': '88', 'R3': '-', 'R4': '-', 'Total': '175', 'Par': 'M/C'}
{'Name': 'Bob FERGUSON', 'Finish': '6', 'R1': '89', 'R2': '87', 'R3': '-', 'R4': '-', 'Total': '176', 'Par': 'M/C'}
{'Name': 'J.O.F. MORRIS', 'Finish': '7', 'R1': '92', 'R2': '87', 'R3': '-', 'R4': '-', 'Total': '179', 'Par': 'M/C'}
{'Name': 'Jack KIRKALDY', 'Finish': '8', 'R1': '92', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'James RENNIE', 'Finish': '8', 'R1': '93', 'R2': '88', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'Willie FERNIE', 'Finish': '8', 'R1': '92', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'David AYTON', 'Finish': '11', 'R1': '95', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Henry LAMB', 'Finish': '11', 'R1': '91', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Tom ARUNDEL', 'Finish': '11', 'R1': '95', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Tom MORRIS SR', 'Finish': '14', 'R1': '92', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'William DOLEMAN', 'Finish': '14', 'R1': '91', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'Robert KINSMAN', 'Finish': '14', 'R1': '88', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'Bob MARTIN', 'Finish': '17', 'R1': '93', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '186', 'Par': 'M/C'}
{'Name': 'Ben SAYERS', 'Finish': '18', 'R1': '92', 'R2': '95', 'R3': '-', 'R4': '-', 'Total': '187', 'Par': 'M/C'}
{'Name': 'David ANDERSON SR', 'Finish': '19', 'R1': '94', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '188', 'Par': 'M/C'}
{'Name': 'David CORSTORPHINE', 'Finish': '20', 'R1': '93', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Tom DUNN', 'Finish': '20', 'R1': '90', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Peter PAXTON', 'Finish': '20', 'R1': '99', 'R2': '90', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': '[A] SMITH', 'Finish': '20', 'R1': '94', 'R2': '95', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'D. GRANT', 'Finish': '20', 'R1': '95', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Bob DOW', 'Finish': '20', 'R1': '95', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Walter GOURLAY', 'Finish': '20', 'R1': '92', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'A.W. SMITH', 'Finish': '27', 'R1': '91', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '190', 'Par': 'M/C'}
{'Name': 'Douglas Argyll ROBERTSON', 'Finish': '27', 'R1': '97', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '190', 'Par': 'M/C'}
{'Name': 'Robert ARMIT', 'Finish': '29', 'R1': '95', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '191', 'Par': 'M/C'}
{'Name': 'George STRATH', 'Finish': '29', 'R1': '97', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '191', 'Par': 'M/C'}
{'Name': 'J.H. BLACKWELL', 'Finish': '31', 'R1': '96', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '192', 'Par': 'M/C'}
{'Name': 'Tom MANZIE', 'Finish': '32', 'R1': '96', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '193', 'Par': 'M/C'}
{'Name': 'George LOWE', 'Finish': '33', 'R1': '94', 'R2': '100', 'R3': '-', 'R4': '-', 'Total': '194', 'Par': 'M/C'}
{'Name': 'G. HONEYMAN', 'Finish': '33', 'R1': '97', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '194', 'Par': 'M/C'}
{'Name': 'James FENTON', 'Finish': '35', 'R1': '99', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '196', 'Par': 'M/C'}
{'Name': 'Robert TAIT', 'Finish': '35', 'R1': '99', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '196', 'Par': 'M/C'}
{'Name': 'Bob KIRK', 'Finish': '37', 'R1': '99', 'R2': '98', 'R3': '-', 'R4': '-', 'Total': '197', 'Par': 'M/C'}
{'Name': 'Rev. D. LUNDIE', 'Finish': '37', 'R1': '98', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '197', 'Par': 'M/C'}
{'Name': 'Fitz BOOTHBY', 'Finish': '39', 'R1': '96', 'R2': '102', 'R3': '-', 'R4': '-', 'Total': '198', 'Par': 'M/C'}
{'Name': 'J. Thomson WHITE', 'Finish': '40', 'R1': '102', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '201', 'Par': 'M/C'}
{'Name': 'James KIRK', 'Finish': '41', 'R1': '105', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '202', 'Par': 'M/C'}
{'Name': 'W.H. GOFF', 'Finish': '42', 'R1': '105', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '204', 'Par': 'M/C'}
当然,您可以使用namedtuple
等其他数据结构来代替dict
另一种方法是创建一个字典,通过您的Rowcontents
进行枚举,并使用关键字as enumerated index(i
(mod 8(i%8
(和value"文本";
RowContents = TournamentSoup.findAll("div", {"class": "final-leaderboard__content"})
d={}
for i, RowContent in enumerate(RowContents):
key = (i)%8
d.setdefault(key, []).append(' '.join(RowContent.text.strip().split()))
>>> d
{
0: ['Name','Jamie ANDERSON Champion Golfer','Andrew KIRKALDY','Jamie ALLAN',....]
1: ['Finish','1','2','2','4',....]
2: ['R1','84','86','88','89','87',....]
.......
7: ['Par','M/C','M/C','M/C','M/C','M/C',.....]
如果你能用熊猫
df = pd.DataFrame(d).rename(columns=df.iloc[0]).drop(df.index[0])
>>> print(df)
Name Finish R1 R2 R3 R4 Total Par
1 Jamie ANDERSON Champion Golfer 1 84 85 - - 169 M/C
2 Andrew KIRKALDY 2 86 86 - - 172 M/C
3 Jamie ALLAN 2 88 84 - - 172 M/C
4 George PAXTON 4 89 85 - - 174 M/C
5 Tom KIDD 5 87 88 - - 175 M/C
6 Bob FERGUSON 6 89 87 - - 176 M/C
7 J.O.F. MORRIS 7 92 87 - - 179 M/C
要将数据帧保存到csv,请使用pandas.to_csv((
df.to_csv('yourfile.csv', index=False)