将第th个标签中的信息附加到td行



我是一名经济学家,正在为编码和数据抓取而苦苦挣扎。我是从这个网页(https://www.oddsportal.com/basketball/europe/euroleague-2013-2014/results/)上的主要和唯一的表中抓取数据。通过引用class元素,我可以用python selenium检索td HTML标记的所有信息。第三个标签也是如此,它存储了比赛日期和阶段的信息。在我的最终数据集中,我希望将信息存储在第th标记中的两行(数据和比赛阶段)中,与表中的其他行相邻。基本上,对于每一场比赛,我希望将比赛日期和阶段排成一行,而不是作为每一组比赛的头牌。我想到的唯一解决方案是索引所有行(包含th和td标记),并构建一个while循环,将第th标记中的信息附加到索引低于第th标记下一个索引的td行。希望我把自己讲清楚了(如果不是,我会尽量给出一个更图形化的解释)。然而,由于我的编码能力差,我无法编写这样的逻辑结构。我不知道我是否需要两个循环来迭代不同的标签(td和th),以及如何做到这一点。如果你有更简单的解决方案,我们非常欢迎!提前感谢您宝贵的帮助!

下面的代码:

from selenium import webdriver
import time
import pandas as pd
# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']
# Define empty data
data_keys = ["Season", "Match_Time", "Home_Team", "Away_Team", "Home_Odd", "Away_Odd", "Home_Score",
"Away_Score", "OT", "N_Bookmakers"]
data = dict()
for key in data_keys:
data[key] = list()
del data_keys

# Define 'driver' variable and launch browser
#path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
#path office pc
path = "C:/Users/aldi/Downloads/chromedriver.exe"
driver = webdriver.Chrome(path)
# Loop through pages based on page_num and season
for season_filt in seasons_filt:
page_num = 0
while True:
page_num += 1

# Get url and navigate it
page_str = (1 - len(str(page_num)))* '0' + str(page_num)
url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
driver.get(url)
time.sleep(3)

# Check if page has no data
if driver.find_elements_by_id("emptyMsg"):
print("Season {} ended at page {}".format(season_filt, page_num))
break

try:      
# Teams
for el in driver.find_elements_by_class_name('name.table-participant'):
el = el.text.strip().split(" - ")
data["Home_Team"].append(el[0])
data["Away_Team"].append(el[1])
data["Season"].append(season_filt)

# Scores
for el in driver.find_elements_by_class_name('center.bold.table-odds.table-score'):
el = el.text.split(":")
if el[1][-3:] == " OT":
data["OT"].append(True)
el[1] = el[1][:-3]
else:
data["OT"].append(False)
data["Home_Score"].append(el[0])
data["Away_Score"].append(el[1])

# Match times
for el in driver.find_elements_by_class_name("table-time"):
data["Match_Time"].append(el.text)

# Odds
i = 0
for el in driver.find_elements_by_class_name("odds-nowrp"):
i += 1
if i%2 == 0:
data["Away_Odd"].append(el.text)
else:
data["Home_Odd"].append(el.text)

# N_Bookmakers
for el in driver.find_elements_by_class_name("center.info-value"):
data["N_Bookmakers"].append(el.text)

# TODO think of inserting the dates list in the dataframe even if it has a different size (19 rows and not 50)
except:
pass
driver.quit()
data = pd.DataFrame(data)
data.to_csv("data_odds.csv", index = False)

我想将此信息添加到我的数据集作为两个额外的行:

for el in driver.find_elements_by_class_name("first2.tl")[1:]:
el = el.text.strip().split(" - ")
data["date"].append(el[0])
data["stage"].append(el[1])

这里有几处我想改。

  1. 不要覆盖变量。将元素存储在el变量中,然后用字符串覆盖该元素。它在这里可能对您有用,但是您可能会在以后的实践中遇到麻烦,特别是当您遍历这些元素时。这也使调试变得困难。

  2. 我知道Selenium有解析html的方法。但我个人觉得BeautifulSoup更容易解析,如果你只是想从html中提取数据,它也更直观一些。所以我使用了BeautifulSoup的.find_previous()来获取游戏之前的标签,基本上可以获得你的日期和舞台内容。

  3. 最后,我想构造一个字典列表来组成数据帧。列表中的每一项都是一个字典键:value,其中键是列名,值是数据。在创建列表字典的时候,你的做法正好相反。这并没有什么问题,但是如果列表的长度不相同,那么在尝试创建数据框时就会出现错误。就像我的方法一样,如果因为任何原因有一个值丢失,它仍然会创建数据框,但只会为丢失的数据提供null或nan。

您可能需要对代码做更多的工作来遍历页面,但这将为您提供所需形式的数据。

代码:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
import pandas as pd
from bs4 import BeautifulSoup
import re
# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']

# Define 'driver' variable and launch browser
path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(path)
rows = []
# Loop through pages based on page_num and season
for season_filt in seasons_filt:
page_num = 0
while True:
page_num += 1

# Get url and navigate it
page_str = (1 - len(str(page_num)))* '0' + str(page_num)
url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
driver.get(url)
time.sleep(3)

# Check if page has no data
if driver.find_elements_by_id("emptyMsg"):
print("Season {} ended at page {}".format(season_filt, page_num))
break

try:      
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.find('table', {'id':'tournamentTable'})

trs = table.find_all('tr', {'class':re.compile('.*deactivate.*')})
for each in trs:
teams = each.find('td', {'class':'name table-participant'}).text.split(' - ')
scores = each.find('td', {'class':re.compile('.*table-score.*')}).text.split(':')
ot = False
for score in scores:
if 'OT' in score:
ot == True
scores = [x.replace('xa0OT','') for x in scores]
matchTime = each.find('td', {'class':re.compile('.*table-time.*')}).text

# Odds
i = 0
for each_odd in each.find_all('td',{'class':"odds-nowrp"}):
i += 1
if i%2 == 0:
away_odd = each_odd.text
else:
home_odd = each_odd.text

n_bookmakers = soup.find('td',{'class':'center info-value'}).text

date_stage = each.find_previous('th', {'class':'first2 tl'}).text.split(' - ')
date = date_stage[0]
stage = date_stage[1]


row = {'Season':season_filt,
'Home_Team':teams[0],
'Away_Team':teams[1],
'Home_Score':scores[0],
'Away_Score':scores[1],
'OT':ot,
'Match_Time':matchTime,
'Home_Odd':home_odd,
'Away_Odd':away_odd,
'N_Bookmakers':n_bookmakers,
'Date':date,
'Stage':stage}

rows.append(row)


except:
pass
driver.quit()
data = pd.DataFrame(rows)
data.to_csv("data_odds.csv", index = False)

输出:

print(data.head(15).to_string())
Season         Home_Team          Away_Team Home_Score Away_Score     OT Match_Time Home_Odd Away_Odd N_Bookmakers         Date       Stage
0   2013-2014       Real Madrid   Maccabi Tel Aviv         86         98  False      18:00     -667     +493            7  18 May 2014  Final Four
1   2013-2014         Barcelona        CSKA Moscow         93         78  False      15:00     -135     +112            7  18 May 2014  Final Four
2   2013-2014         Barcelona        Real Madrid         62        100  False      19:00     +134     -161            7  16 May 2014  Final Four
3   2013-2014       CSKA Moscow   Maccabi Tel Aviv         67         68  False      16:00     -278     +224            7  16 May 2014  Final Four
4   2013-2014       Real Madrid        Olympiacos          83         69  False      18:45     -500     +374            7  25 Apr 2014   Play Offs
5   2013-2014       CSKA Moscow     Panathinaikos          74         44  False      16:00     -370     +295            7  25 Apr 2014   Play Offs
6   2013-2014        Olympiacos       Real Madrid          71         62  False      18:45     +127     -152            7  23 Apr 2014   Play Offs
7   2013-2014  Maccabi Tel Aviv    Olimpia Milano          86         66  False      17:45     -217     +179            7  23 Apr 2014   Play Offs
8   2013-2014     Panathinaikos       CSKA Moscow          73         72  False      16:30     -106     -112            7  23 Apr 2014   Play Offs
9   2013-2014     Panathinaikos       CSKA Moscow          65         59  False      18:45     -125     +104            7  21 Apr 2014   Play Offs
10  2013-2014  Maccabi Tel Aviv    Olimpia Milano          75         63  False      18:15     -189     +156            7  21 Apr 2014   Play Offs
11  2013-2014        Olympiacos       Real Madrid          78         76  False      17:00     +104     -125            7  21 Apr 2014   Play Offs
12  2013-2014       Galatasaray         Barcelona          75         78  False      17:00     +264     -333            7  20 Apr 2014   Play Offs
13  2013-2014    Olimpia Milano  Maccabi Tel Aviv          91         77  False      18:45     -286     +227            7  18 Apr 2014   Play Offs
14  2013-2014       CSKA Moscow     Panathinaikos          77         51  False      16:15     -303     +247            7  18 Apr 2014   Play Offs

最新更新