在R中用未更改的URL刮擦网站



我想从一个网站上抓取一系列表格,当我在浏览器中点击这些表格时,这些表格的URL不会改变。每个表格对应一个唯一的日期。默认表格是与今天的日期相对应的表格。我可以在浏览器中滚动浏览过去的日期,但在R.中似乎找不到这样做的方法

使用library(rvest),这段代码将可靠地下载对应于今天日期的表(我只对三个表中的第一个感兴趣(。

webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
read_html()  %>%
html_table()
off <- off[[1]]

我如何才能下载对应于例如";2022-10-04";,至";2022-10-06";,还是昨天?

我试图通过识别表所在的节点来完成它,希望我可以操纵它来反映之前的日期。但是,以下表格与上表相同:

webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
read_html() %>%
html_nodes("#main > div > section:nth-child(1) > article > div > div.dayContent > div > table") %>%
html_table()
off <- off[[1]]

在浏览器中滚动浏览过去的日期,我在html中识别出了引用前一日期的各个位置;但我似乎无法将其从R中更改,但我只能下载表格来反映一个更改:

webad %>%
read_html() %>%
html_nodes("#main > div > section:nth-child(1) > article > header > div")

我对html_form()follow_link()set_values()也做了一些改动,但都无济于事。

有没有一种好的方法可以在R中导航这个特定的URL?

您可以考虑以下方法:

library(RSelenium)
library(rvest)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
remDr$open()
url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()
html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game                     `Official 1`            `Official 2`         `Official 3`          Alternate
<chr>                    <chr>                   <chr>                <chr>                 <lgl>    
1 Indiana @ Charlotte      John Goble (#10)        Lauren Holtkamp (#7) Phenizee Ransom (#70) NA       
2 Cleveland @ Philadelphia Marc Davis (#8)         Jacyn Goble (#68)    Tyler Mirkovich (#97) NA       
3 Toronto @ Boston         Josh Tiven (#58)        Matt Boland (#18)    Intae hwang (#96)     NA       
4 Dallas @ Oklahoma City   Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91)   NA       
5 Phoenix @ L.A. Lakers    Bill Kennedy (#55)      Rodney Mott (#71)    Jenna Reneau (#93)    NA       
[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[4]]
# A tibble: 6 x 7
S     M     T     W     T     F     S
<int> <int> <int> <int> <int> <int> <int>
1    NA    NA    NA    NA    NA    NA     1
2     2     3     4     5     6     7     8
3     9    10    11    12    13    14    15
4    16    17    18    19    20    21    22
5    23    24    25    26    27    28    29
6    30    31    NA    NA    NA    NA    NA

这里有另一种可以考虑的方法:

library(RDCOMClient)
library(rvest)
url <- "https://official.nba.com/referee-assignments/"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
clickEvent <- doc$createEvent("MouseEvent")
clickEvent$initEvent("click", TRUE, FALSE)
web_Obj_Date <- doc$querySelector("#ref-filters-menu > li > div > button")
web_Obj_Date$dispatchEvent(clickEvent)
web_Obj_Date_Input <- doc$GetElementById('ref-date')
web_Obj_Date_Input[["Value"]] <- "2022-10-05"
web_Obj_Go_Button <- doc$querySelector("#date-filter")
web_Obj_Go_Button$dispatchEvent(clickEvent)
html_Content <- doc$Body()$innerHTML()
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game                     `Official 1`            `Official 2`         `Official 3`          Alternate
<chr>                    <chr>                   <chr>                <chr>                 <lgl>    
1 Indiana @ Charlotte      John Goble (#10)        Lauren Holtkamp (#7) Phenizee Ransom (#70) NA       
2 Cleveland @ Philadelphia Marc Davis (#8)         Jacyn Goble (#68)    Tyler Mirkovich (#97) NA       
3 Toronto @ Boston         Josh Tiven (#58)        Matt Boland (#18)    Intae hwang (#96)     NA       
4 Dallas @ Oklahoma City   Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91)   NA       
5 Phoenix @ L.A. Lakers    Bill Kennedy (#55)      Rodney Mott (#71)    Jenna Reneau (#93)    NA       
[[2]]
# A tibble: 8 x 7
Game   `Official 1` `Official 2` `Official 3` Alternate   ``    ``   
<chr>  <chr>        <chr>        <chr>        <chr>       <chr> <chr>
1 "Game" "Official 1" "Official 2" "Official 3" "Alternate"  NA    NA  
2 "S"    "M"          "T"          "W"          "T"         "F"   "S"  
3 ""     ""           ""           ""           ""          ""    "1"  
4 "2"    "3"          "4"          "5"          "6"         "7"   "8"  
5 "9"    "10"         "11"         "12"         "13"        "14"  "15" 
6 "16"   "17"         "18"         "19"         "20"        "21"  "22" 
7 "23"   "24"         "25"         "26"         "27"        "28"  "29" 
8 "30"   "31"         ""           ""           ""          ""    ""   
[[3]]
# A tibble: 7 x 7
Game  `Official 1` `Official 2` `Official 3` Alternate ``    ``   
<chr> <chr>        <chr>        <chr>        <chr>     <chr> <chr>
1 "S"   "M"          "T"          "W"          "T"       "F"   "S"  
2 ""    ""           ""           ""           ""        ""    "1"  
3 "2"   "3"          "4"          "5"          "6"       "7"   "8"  
4 "9"   "10"         "11"         "12"         "13"      "14"  "15" 
5 "16"  "17"         "18"         "19"         "20"      "21"  "22" 
6 "23"  "24"         "25"         "26"         "27"      "28"  "29" 
7 "30"  "31"         ""           ""           ""        ""    ""   
[[4]]
# A tibble: 6 x 7
S     M     T     W     T     F     S
<int> <int> <int> <int> <int> <int> <int>
1    NA    NA    NA    NA    NA    NA     1
2     2     3     4     5     6     7     8
3     9    10    11    12    13    14    15
4    16    17    18    19    20    21    22
5    23    24    25    26    27    28    29
6    30    31    NA    NA    NA    NA    NA

如果安装Docker软件(请参阅https://docs.docker.com/engine/install/),您可以考虑使用firefox的以下方法:

library(RSelenium)
library(rvest)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()
html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game                     `Official 1`            `Official 2`         `Official 3`          Alternate
<chr>                    <chr>                   <chr>                <chr>                 <lgl>    
1 Indiana @ Charlotte      John Goble (#10)        Lauren Holtkamp (#7) Phenizee Ransom (#70) NA       
2 Cleveland @ Philadelphia Marc Davis (#8)         Jacyn Goble (#68)    Tyler Mirkovich (#97) NA       
3 Toronto @ Boston         Josh Tiven (#58)        Matt Boland (#18)    Intae hwang (#96)     NA       
4 Dallas @ Oklahoma City   Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91)   NA       
5 Phoenix @ L.A. Lakers    Bill Kennedy (#55)      Rodney Mott (#71)    Jenna Reneau (#93)    NA       
[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[4]]
# A tibble: 6 x 7
S     M     T     W     T     F     S
<int> <int> <int> <int> <int> <int> <int>
1    NA    NA    NA    NA    NA    NA     1
2     2     3     4     5     6     7     8
3     9    10    11    12    13    14    15
4    16    17    18    19    20    21    22
5    23    24    25    26    27    28    29
6    30    31    NA    NA    NA    NA    NA

最新更新