Python-BeautifulSoup-按列表中的特定元素遍历findall



我对网络抓取世界还很陌生,所以我正在寻找一些指导,以解决我几个小时来一直在努力解决的问题。

我正试图循环遍历一个看起来像表的结构(虽然它不是一个实际的表(,并使用findall来带回某个标记的所有细节。

我面临的挑战是;表";具有相同的类名";最终排行榜_内容"所以我留下了一个巨大的列表,所以我想迭代并检索的详细信息,这样我就可以创建一个包含详细信息的csv/excel。这是下面的代码


from bs4 import BeautifulSoup
import requests
TournamentURL = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/"
TournamentResponse = requests.get(TournamentURL)
TournamentData = TournamentResponse.text
TournamentSoup = BeautifulSoup(TournamentData, 'html.parser')
RowContents = TournamentSoup.findAll("div", {"class": "final-leaderboard__content"})
for RowContent in RowContents:

结果是这样的,如果没有任何明确的标签/id,我无法找到最好的方法来知道项目0,8,16等是玩家名称,项目1,9,17等是完成等

[0] - Name
[1] - Finish
[2] - R1
[3] - R2
[4] - R3
[5] - R4
[6] - Total
[7] - Par
[8] - Name (The second Name)
[9] - Finish (The second Finish) 
etc
etc

我尝试过拼接、模和其他各种相同的变体,但似乎无法解决。

您可以利用这是实际上的一种表格数据的事实,获取表示一行的所有divs,将其拆分为多列,这样就有了您的数据:

import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/#leaderboard"
page = requests.get(url).content
leaderboard = BeautifulSoup(page, "html.parser").find_all("div", {"class": "final-leaderboard__content"})
column_count = 8
split_by_columns = [
leaderboard[i:i+column_count] for i in range(0, len(leaderboard), column_count)
]
table = [[i.getText(strip=True) for i in row] for row in split_by_columns]
print(tabulate(table[1:], headers=table[0]))

输出:

Name                             Finish    R1    R2  R3    R4      Total  Par
-----------------------------  --------  ----  ----  ----  ----  -------  -----
Jamie ANDERSONChampion Golfer         1    84    85  -     -         169  M/C
Andrew KIRKALDY                       2    86    86  -     -         172  M/C
Jamie ALLAN                           2    88    84  -     -         172  M/C
George  PAXTON                        4    89    85  -     -         174  M/C
Tom KIDD                              5    87    88  -     -         175  M/C
Bob FERGUSON                          6    89    87  -     -         176  M/C
J.O.F. MORRIS                         7    92    87  -     -         179  M/C
Jack KIRKALDY                         8    92    89  -     -         181  M/C
James RENNIE                          8    93    88  -     -         181  M/C
Willie FERNIE                         8    92    89  -     -         181  M/C
David AYTON                          11    95    89  -     -         184  M/C
Henry LAMB                           11    91    93  -     -         184  M/C
Tom ARUNDEL                          11    95    89  -     -         184  M/C
Tom MORRIS SR                        14    92    93  -     -         185  M/C
William DOLEMAN                      14    91    94  -     -         185  M/C
Robert KINSMAN                       14    88    97  -     -         185  M/C
Bob MARTIN                           17    93    93  -     -         186  M/C
Ben SAYERS                           18    92    95  -     -         187  M/C
David ANDERSON SR                    19    94    94  -     -         188  M/C
David CORSTORPHINE                   20    93    96  -     -         189  M/C
Tom DUNN                             20    90    99  -     -         189  M/C
Peter PAXTON                         20    99    90  -     -         189  M/C
[A] SMITH                            20    94    95  -     -         189  M/C
D. GRANT                             20    95    94  -     -         189  M/C
Bob DOW                              20    95    94  -     -         189  M/C
Walter GOURLAY                       20    92    97  -     -         189  M/C
A.W. SMITH                           27    91    99  -     -         190  M/C
Douglas Argyll ROBERTSON             27    97    93  -     -         190  M/C
Robert ARMIT                         29    95    96  -     -         191  M/C
George  STRATH                       29    97    94  -     -         191  M/C
J.H. BLACKWELL                       31    96    96  -     -         192  M/C
Tom MANZIE                           32    96    97  -     -         193  M/C
George LOWE                          33    94   100  -     -         194  M/C
G. HONEYMAN                          33    97    97  -     -         194  M/C
James FENTON                         35    99    97  -     -         196  M/C
Robert TAIT                          35    99    97  -     -         196  M/C
Bob KIRK                             37    99    98  -     -         197  M/C
Rev. D. LUNDIE                       37    98    99  -     -         197  M/C
Fitz BOOTHBY                         39    96   102  -     -         198  M/C
J. Thomson WHITE                     40   102    99  -     -         201  M/C
James KIRK                           41   105    97  -     -         202  M/C
W.H. GOFF                            42   105    99  -     -         204  M/C
import requests
from bs4 import BeautifulSoup
def parse_row(row):
for div in row.find_all("div", {"class": "final-leaderboard__content"}):
yield div.text.strip().replace('n', ' ')

url = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/#leaderboard"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find("div", {"class": "final-leaderboard__table"})
rows = table.find_all('div', {'class':"final-leaderboard__row"})
header = list(parse_row(rows[0]))
for row in rows[1:]:
print(dict(zip(header, list(parse_row(row)))))

输出

{'Name': 'Jamie ANDERSON     Champion Golfer', 'Finish': '1', 'R1': '84', 'R2': '85', 'R3': '-', 'R4': '-', 'Total': '169', 'Par': 'M/C'}
{'Name': 'Andrew KIRKALDY', 'Finish': '2', 'R1': '86', 'R2': '86', 'R3': '-', 'R4': '-', 'Total': '172', 'Par': 'M/C'}
{'Name': 'Jamie ALLAN', 'Finish': '2', 'R1': '88', 'R2': '84', 'R3': '-', 'R4': '-', 'Total': '172', 'Par': 'M/C'}
{'Name': 'George  PAXTON', 'Finish': '4', 'R1': '89', 'R2': '85', 'R3': '-', 'R4': '-', 'Total': '174', 'Par': 'M/C'}
{'Name': 'Tom KIDD', 'Finish': '5', 'R1': '87', 'R2': '88', 'R3': '-', 'R4': '-', 'Total': '175', 'Par': 'M/C'}
{'Name': 'Bob FERGUSON', 'Finish': '6', 'R1': '89', 'R2': '87', 'R3': '-', 'R4': '-', 'Total': '176', 'Par': 'M/C'}
{'Name': 'J.O.F. MORRIS', 'Finish': '7', 'R1': '92', 'R2': '87', 'R3': '-', 'R4': '-', 'Total': '179', 'Par': 'M/C'}
{'Name': 'Jack KIRKALDY', 'Finish': '8', 'R1': '92', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'James RENNIE', 'Finish': '8', 'R1': '93', 'R2': '88', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'Willie FERNIE', 'Finish': '8', 'R1': '92', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'David AYTON', 'Finish': '11', 'R1': '95', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Henry LAMB', 'Finish': '11', 'R1': '91', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Tom ARUNDEL', 'Finish': '11', 'R1': '95', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Tom MORRIS SR', 'Finish': '14', 'R1': '92', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'William DOLEMAN', 'Finish': '14', 'R1': '91', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'Robert KINSMAN', 'Finish': '14', 'R1': '88', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'Bob MARTIN', 'Finish': '17', 'R1': '93', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '186', 'Par': 'M/C'}
{'Name': 'Ben SAYERS', 'Finish': '18', 'R1': '92', 'R2': '95', 'R3': '-', 'R4': '-', 'Total': '187', 'Par': 'M/C'}
{'Name': 'David ANDERSON SR', 'Finish': '19', 'R1': '94', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '188', 'Par': 'M/C'}
{'Name': 'David CORSTORPHINE', 'Finish': '20', 'R1': '93', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Tom DUNN', 'Finish': '20', 'R1': '90', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Peter PAXTON', 'Finish': '20', 'R1': '99', 'R2': '90', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': '[A] SMITH', 'Finish': '20', 'R1': '94', 'R2': '95', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'D. GRANT', 'Finish': '20', 'R1': '95', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Bob DOW', 'Finish': '20', 'R1': '95', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Walter GOURLAY', 'Finish': '20', 'R1': '92', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'A.W. SMITH', 'Finish': '27', 'R1': '91', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '190', 'Par': 'M/C'}
{'Name': 'Douglas Argyll ROBERTSON', 'Finish': '27', 'R1': '97', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '190', 'Par': 'M/C'}
{'Name': 'Robert ARMIT', 'Finish': '29', 'R1': '95', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '191', 'Par': 'M/C'}
{'Name': 'George  STRATH', 'Finish': '29', 'R1': '97', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '191', 'Par': 'M/C'}
{'Name': 'J.H. BLACKWELL', 'Finish': '31', 'R1': '96', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '192', 'Par': 'M/C'}
{'Name': 'Tom MANZIE', 'Finish': '32', 'R1': '96', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '193', 'Par': 'M/C'}
{'Name': 'George LOWE', 'Finish': '33', 'R1': '94', 'R2': '100', 'R3': '-', 'R4': '-', 'Total': '194', 'Par': 'M/C'}
{'Name': 'G. HONEYMAN', 'Finish': '33', 'R1': '97', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '194', 'Par': 'M/C'}
{'Name': 'James FENTON', 'Finish': '35', 'R1': '99', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '196', 'Par': 'M/C'}
{'Name': 'Robert TAIT', 'Finish': '35', 'R1': '99', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '196', 'Par': 'M/C'}
{'Name': 'Bob KIRK', 'Finish': '37', 'R1': '99', 'R2': '98', 'R3': '-', 'R4': '-', 'Total': '197', 'Par': 'M/C'}
{'Name': 'Rev. D. LUNDIE', 'Finish': '37', 'R1': '98', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '197', 'Par': 'M/C'}
{'Name': 'Fitz BOOTHBY', 'Finish': '39', 'R1': '96', 'R2': '102', 'R3': '-', 'R4': '-', 'Total': '198', 'Par': 'M/C'}
{'Name': 'J. Thomson WHITE', 'Finish': '40', 'R1': '102', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '201', 'Par': 'M/C'}
{'Name': 'James KIRK', 'Finish': '41', 'R1': '105', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '202', 'Par': 'M/C'}
{'Name': 'W.H. GOFF', 'Finish': '42', 'R1': '105', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '204', 'Par': 'M/C'}

当然,您可以使用namedtuple等其他数据结构来代替dict

另一种方法是创建一个字典,通过您的Rowcontents进行枚举,并使用关键字as enumerated index(i(mod 8(i%8(和value"文本";

RowContents = TournamentSoup.findAll("div", {"class": "final-leaderboard__content"})
d={}
for i, RowContent in enumerate(RowContents):
key = (i)%8
d.setdefault(key, []).append(' '.join(RowContent.text.strip().split()))
>>> d
{
0: ['Name','Jamie ANDERSON Champion Golfer','Andrew KIRKALDY','Jamie ALLAN',....]
1: ['Finish','1','2','2','4',....]
2: ['R1','84','86','88','89','87',....]
.......
7: ['Par','M/C','M/C','M/C','M/C','M/C',.....]

如果你能用熊猫

df = pd.DataFrame(d).rename(columns=df.iloc[0]).drop(df.index[0])
>>> print(df)  
Name Finish   R1   R2 R3 R4 Total  Par
1   Jamie ANDERSON Champion Golfer      1   84   85  -  -   169  M/C
2                  Andrew KIRKALDY      2   86   86  -  -   172  M/C
3                      Jamie ALLAN      2   88   84  -  -   172  M/C
4                    George PAXTON      4   89   85  -  -   174  M/C
5                         Tom KIDD      5   87   88  -  -   175  M/C
6                     Bob FERGUSON      6   89   87  -  -   176  M/C
7                    J.O.F. MORRIS      7   92   87  -  -   179  M/C

要将数据帧保存到csv,请使用pandas.to_csv((

df.to_csv('yourfile.csv', index=False)

最新更新