使用硒以不一致的顺序处理数据以进行网页抓取

以下三个URL是我尝试抓取的数据示例。该信息位于页面左侧，包括运动信息以及其他一些统计数据。数据被拉取为一个大元素。我试图按索引号分隔个人信息，但每个运动员的信息顺序不同，或者根本不可用。这会导致索引错误或一起获取错误的信息(即在深蹲号码下获得 40 码破折号(：

https://www.hudl.com/profile/7670389/GaQuincy-McKinstry 泽西岛#： 1 职位： CB， WR 身高和体重：6'1" 189磅 40码短跑：4.55 板凳： 190 深蹲(磅(： 370 清洁(磅(： 225 2021届
https://www.hudl.com/profile/10316846/Dylan-Rosiek 泽西岛#：6 位置：美国职业棒球大联盟，RB 身高和体重：6'1" 210磅 2021届
https://www.hudl.com/profile/10015742/Donovan-Jackson 泽西岛#：77 位置： T， G 身高和体重：6'4" 310磅 40码短跑：5.1 垂直：29 强力球： 35 板凳： 365 深蹲(磅(： 415 硬拉(磅(： 435 2021届

如何确保我正在写入 pandas 数据库中的正确列。下面是我为第一个 URL 尝试的代码，该 URL 专门为该页面编制索引，但在其他页面上不起作用。我暂时放了打印函数，看看我拉了什么数据，但最终会做一个熊猫数据库。我也不确定我是否应该通过CSS选择器或类名获取信息。

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
import time
TIMEOUT = 5
driver = webdriver.Firefox()
driver.set_page_load_timeout(TIMEOUT)
url = 'https://www.hudl.com/profile/7670389/GaQuincy-McKinstry'
try:
driver.get(url)
except TimeoutException:
pass
time.sleep(3)
try:
isPresent = driver.find_element_by_xpath('//[@id="app"]/div/div[2]/div/div/div[2]/div[3]/div/div[1]/div[1]/div[1]/button')
isPresent.click()
except:
pass
time.sleep(3)
skills = driver.find_elements_by_css_selector('#app > div > div.prof-flex-height > div > div > div.parallax-layer.front > div.profile-tab > div > div.left-column > div.stats > ul')
skills = [one.text for one in skills]
print(skills)
try:
athletic_skills = driver.find_elements_by_class_name('stats-list')
athletic_skills = [skill.text for skill in athletic_skills]
athletic_skills = athletic_skills[-1].split('n')
jersey = athletic_skills[0].replace('Jersey #: ', '')
position = athletic_skills[1].replace('Positions: ', '')
height_weight = athletic_skills[2].replace('Height & Weight: ', '')
height_weight = height_weight.split()
height = height_weight[0]
weight = height_weight[-1]
yard_dash = athletic_skills[3].replace('40 Yard Dash: ', '')
bench = athletic_skills[4].replace('Bench: ', '')
squat = athletic_skills[5].replace('Squat(LBS): ', '')
clean = athletic_skills[6].replace('Clean(LBS): ', '')
grad_year = athletic_skills[7].replace('Class of: ', '')
print(athletic_skills)
print(jersey)
print(position)
print(height_weight)
print(height)
print(weight)
print(yard_dash)
print(bench)
print(squat)
print(clean)
print(grad_year)
except:
pass
driver.close()

简短的回答：首先为每个玩家将原始数据加载到 Python 字典中。

更长的答案：

字典结构允许您映射键(例如40 Yard Dash(到相关统计(例如4.55(。

您可以使用已在athletic_skills中捕获的数据作为起点。

例如：

# new empty dictionary:
mckinstry_skills = {}
for skill_stats in athletic_skills:
# separate the skill name from the related statistic:
skill_stats = skill_stats.split(': ', 1) 
# add this as a new entry into the dictionary:
mckinstry_skills[skill_stats[0]] = skill_stats[1]
# print the full dictionary:
print(mckinstry_skills)
# print the results of retrieving one item:
print(mckinstry_skills['40 Yard Dash'])

第一个print语句给出了以下输出(为清楚起见，我设置了格式(：

{ 
'Jersey #'       : '1', 
'Positions'      : 'CB, WR', 
'Height & Weight': '6'1" 189lbs', 
'40 Yard Dash'   : '4.55', 
'Bench'          : '190', 
'Squat(LBS)'     : '370', 
'Clean(LBS)'     : '225', 
'Class of'       : '2021'
}

第二个print语句仅返回以下内容：

4.55

现在，您始终可以可靠地获得所需熊猫列的正确统计信息。

由于并非所有玩家都有所有统计信息，因此在尝试获取相关统计信息之前，您可能需要确保键存在：

if '40 Yard Dash' in mckinstry_skills:
print(mckinstry_skills['40 Yard Dash'])

如果您不熟悉字典，有很多可用的概述。如果你已经很熟悉了，那么请原谅我的过度解释。

相关内容

最新更新

热门标签：