Scraping with CSS Return NULL in R & Python



我试图在给定年份(在本例中为2005年(从某个州(在本例中为阿拉巴马州(抓取所有已签约的足球运动员。这是该信息的网站链接: https://247sports.com/Season/2005-Football/Recruits/?&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6

首先,我只想刮掉这个名字。我之前已经成功地抓取了这段代码(在另一个网站/页面上(,但是这一次,我用选择器小工具获得的值是".name a",当我把它放在我的代码中时,无论是R&Python,我都没有得到任何信息。

R 代码

#pull in website by year and state, testing with Alabama in 2005
web_link2 <- "https://247sports.com/Season/2005-Football/Recruits/?&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6"
web247_in2 <- read_html(web_link2)
#pull the body of the html site
web_body2 <- web247_in2 %>%
html_node("body") %>%
html_children()

#Pull out all data from website by variable & clean up#
commit_names2 <- html_nodes(web_body2, '.name a') %>%
html_text() %>%
as.data.frame()

蟒蛇代码

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://247sports.com/Season/2005-Football/Recruits/?&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/573.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, features="html.parser")
commit_names = soup.select(".names a")
print(commit_names)

更糟糕的是,此页面是一个滚动页面,在您滚动时显示更多信息。我计划在成功拉动后解决这个问题。

这是该站点上另一个页面的示例,我能够使用相同的代码成功抓取该页面。

成功的 R 抓取示例

web_link <- "https://247sports.com/Season/2005-Football/Commits/?RecruitState=AL"
web247_in <- read_html(web_link)

#pull the body of the html site
web_body <- web247_in %>%
html_node("body") %>%
html_children()

#Pull out all data from website by variable & clean up#
commit_names <- html_nodes(web_body, '.ri-page__name-link') %>%
html_text() %>%
as.data.frame()

成功的 Python 抓取示例

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://247sports.com/Season/2010-Football/Commits/?RecruitState=AL"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/573.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, features="html.parser")
commit_names = soup.select(".ri-page__name-link")
print(commit_names)

我的偏好是R,但此时我会采取我能得到的一切。谁能阐明我在这里错过了什么?唯一似乎改变的是用于抓取和实际页面的 CSS 值 - 但它只是没有提取数据。

感谢您的帮助!!

表的内容是动态加载的;这就是为什么如果你以这种方式抓取它,它就找不到了。

如果您右键单击该页面并单击"检查元素",转到"网络"选项卡并刷新页面,您会看到正在向

https://247sports.com/Season/2005-Football/Recruits.json?&Items=15&Page=1&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6

此请求返回包含所需信息的 JSON。

一些 R 代码使用jsonlite加载它并使用tidyr::unnest_wider解析它(有关该函数的帮助,请参阅此小插图(:

library(jsonlite)
library(rvest)
url <- "https://247sports.com/Season/2005-Football/Recruits.json?&Items=15&Page=1&Player.Hometown.State.Key=sa_2&RecruitInterestEvents.Type=6"
res <- read_json(url)
tibble(res = res) %>% 
unnest_wider(res) %>% 
unnest_wider(Player, names_sep = "_")

这给出了一个包含玩家信息的 tibble:

# A tibble: 15 x 50
Key Player_Key Player_Hometown Player_FirstName Player_LastName Player_FullName Player_Height Player_Weight Player_Bio
<int>      <int> <list>          <chr>            <chr>           <chr>           <chr>                 <dbl> <chr>     
1 27063      25689 <named list [2… Chris            Keys            Chris Keys      6-2                     215 Chris Key…
2 44079      41761 <named list [2… Tommy            Trott           Tommy Trott     6-4                     235 Tommy Tro…
3 44073      41755 <named list [2… Rex              Sharpe          Rex Sharpe      6-3                     215 Rex Sharp…
4 44053      41735 <named list [2… Gabe             McKenzie        Gabe McKenzie   6-3                     218 Gabe McKe…
5 44015      41697 <named list [2… Montez           Billings        Montez Billings 6-2                     175 Montez Bi…
6 44241      41921 <named list [2… Bobby            Greenwood       Bobby Greenwood 6-4                     239 Bobby Gre…

最新更新