r语言 - 网页抓取不工作,表未检测到



我试图从https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps抓取数据,使所有球员的名字和他们的统计数据(整体评级,位置,pac, shoo, pas, dri, def, phy),但是我的投资不能检测到表的信息。

我试着:

for(i in 1:10) {
page <- read_html(paste("https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps",sep=""))
}
StatsTable <- page %>%
html_table(fill=TRUE)
head(StatsTable)

这导致输出一个list()而不是一个表。如何编辑我的for循环的数据是由read_html和html_table在网站上的数据检测,这样我就可以创建一个数据帧与玩家的统计数据?


我也尝试过这样做的第一页:

first <- read_html("https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps",sep="")
first
tab <- first %>%
html_nodes(".padding-0") %>%
html_text()
tab
### Deletes spaces and n
tab <- gsub("  ", "", tab)
tab <- gsub("n", " ", tab)
tab

这样我从第一页得到了所有的数据,但是所有的信息都是用字符表示的。也许如果有可能从这些字符中提取名称和统计信息,使其成为一个数据框架?这怎么可能呢?

我不认为你可以完成你想使用html_table。您试图抓取的页面上的表不是html表元素。

您会注意到,看起来像表的东西实际上是<ul class="list-group list-group-table player-group-table">。然后,您需要使用不同的html_node()命令获取所需的信息。例如

page %>% 
html_nodes("[class='list-group list-group-table player-group-table']") %>% 
html_nodes("[class='player-info']") %>% html_nodes("[class='player-image']") %>% 
html_attr("alt") -> player_names 

page %>% 
html_nodes("[class='player-right text-center hidden-xs']") %>% 
html_nodes("[class='value']") %>% 
html_text() %>% 
matrix(nrow=length(player_names), ncol=6, byrow=T) -> player_stats

获取玩家位置的一种方法是使用gsub()player-club-league-name类中找到<strong></strong>之间的字符串。

page %>% 
html_nodes("[class='list-group list-group-table player-group-table']") %>% 
html_nodes("[class='player-club-league-name']") %>% 
gsub(".*<strong>(.+)</strong>.*", "\1", .) -> positions

最后,将所有内容放入data.frame:

# make player_names into a tibble and extract overall score
library(tidyverse)
player_names %>% as_tibble() -> player_names 
names(player_names) <- 'player'
substr(player_names$player, str_length(player_names$player)-1, str_length(player_names$player)) -> overall
player_names$overall <- overall

# stat names for player_stats
as_tibble(player_stats) -> player_stats
names(player_stats) <- c('pac','sho','pas','dri','def','phy')
#bind everything together
bind_cols(player_names, player_stats) -> players
rm(player_names); rm(player_stats) 

结果:

> players
# A tibble: 48 x 8
player                          overall pac   sho   pas   dri   def   phy
<chr>                           <chr>   <chr> <chr> <chr> <chr> <chr> <chr>
1 Lionel Messi 93                 93      85    92    91    95    34    65
2 Robert Lewandowski 92           92      78    92    79    86    44    82
3 C. Ronaldo dos Santos Aveiro 91 91      87    93    82    88    34    75
4 Kevin De Bruyne 91              91      76    86    93    88    64    78
5 Neymar da Silva Santos Jr. 91   91      91    83    86    94    37    63
6 Kylian Mbappé 91                91      97    88    80    92    36    77
7 Harry Kane 90                   90      70    91    83    83    47    83
8 N'Golo Kanté 90                 90      78    66    75    82    87    83
9 Mohamed Salah 89                89      90    87    81    90    45    75
10 Karim Benzema 89                89      76    86    81    87    39    77
# … with 38 more rows

我更新了代码,以便您立即将前十个子页面刮入一个数据框架。请注意,刮痧的代码来自@Otto_Kässi答案,所以所有的功劳都应该归他!!

library(rvest)
library(stringr)
library(tidyverse)
url <- "https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps"
p1 <- str_c("https://www.futhead.com/22/players/",'?page=', 1:10)
pages <- paste0(p1,"&level=gold_nif&bin_platform=ps")
df <- tibble(player = character(),
overall= character(),
pac = character(),
sho = character(),
pas = character(),
dri = character(),
def = character(),
phy = character())
for (i in pages) {
i %>% read_html() %>% 
html_nodes("[class='list-group list-group-table player-group-table']") %>% 
html_nodes("[class='player-info']") %>% html_nodes("[class='player-image']") %>% 
html_attr("alt") -> player_names 

i %>% read_html() %>% 
html_nodes("[class='player-right text-center hidden-xs']") %>% 
html_nodes("[class='value']") %>% 
html_text() %>% 
matrix(nrow=length(player_names), ncol=6, byrow=T) -> player_stats

player_names %>% as_tibble() -> player_names 
names(player_names) <- 'player'
substr(player_names$player, str_length(player_names$player)-1, str_length(player_names$player)) -> overall
player_names$overall <- overall

as_tibble(player_stats) -> player_stats
names(player_stats) <- c('pac','sho','pas','dri','def','phy')

#bind everything together
bind_cols(player_names, player_stats) -> players
df <- rbind(df, players)
rm(player_names); rm(player_stats); rm(players)
} 
df <- df %>% mutate(player = str_replace_all(player, "[:digit:]", "")) %>%  mutate_at(vars(2:7), as.numeric)

如果你一次运行整个代码,它应该工作!