我试图从https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps抓取数据,使所有球员的名字和他们的统计数据(整体评级,位置,pac, shoo, pas, dri, def, phy),但是我的投资不能检测到表的信息。
我试着:
for(i in 1:10) {
page <- read_html(paste("https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps",sep=""))
}
StatsTable <- page %>%
html_table(fill=TRUE)
head(StatsTable)
这导致输出一个list()而不是一个表。如何编辑我的for循环的数据是由read_html和html_table在网站上的数据检测,这样我就可以创建一个数据帧与玩家的统计数据?
我也尝试过这样做的第一页:
first <- read_html("https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps",sep="")
first
tab <- first %>%
html_nodes(".padding-0") %>%
html_text()
tab
### Deletes spaces and n
tab <- gsub(" ", "", tab)
tab <- gsub("n", " ", tab)
tab
这样我从第一页得到了所有的数据,但是所有的信息都是用字符表示的。也许如果有可能从这些字符中提取名称和统计信息,使其成为一个数据框架?这怎么可能呢?
我不认为你可以完成你想使用html_table。您试图抓取的页面上的表不是html表元素。
您会注意到,看起来像表的东西实际上是<ul class="list-group list-group-table player-group-table">
。然后,您需要使用不同的html_node()命令获取所需的信息。例如
page %>%
html_nodes("[class='list-group list-group-table player-group-table']") %>%
html_nodes("[class='player-info']") %>% html_nodes("[class='player-image']") %>%
html_attr("alt") -> player_names
和
page %>%
html_nodes("[class='player-right text-center hidden-xs']") %>%
html_nodes("[class='value']") %>%
html_text() %>%
matrix(nrow=length(player_names), ncol=6, byrow=T) -> player_stats
获取玩家位置的一种方法是使用gsub()
从player-club-league-name
类中找到<strong>
和</strong>
之间的字符串。
page %>%
html_nodes("[class='list-group list-group-table player-group-table']") %>%
html_nodes("[class='player-club-league-name']") %>%
gsub(".*<strong>(.+)</strong>.*", "\1", .) -> positions
最后,将所有内容放入data.frame:
# make player_names into a tibble and extract overall score
library(tidyverse)
player_names %>% as_tibble() -> player_names
names(player_names) <- 'player'
substr(player_names$player, str_length(player_names$player)-1, str_length(player_names$player)) -> overall
player_names$overall <- overall
# stat names for player_stats
as_tibble(player_stats) -> player_stats
names(player_stats) <- c('pac','sho','pas','dri','def','phy')
#bind everything together
bind_cols(player_names, player_stats) -> players
rm(player_names); rm(player_stats)
结果:
> players
# A tibble: 48 x 8
player overall pac sho pas dri def phy
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Lionel Messi 93 93 85 92 91 95 34 65
2 Robert Lewandowski 92 92 78 92 79 86 44 82
3 C. Ronaldo dos Santos Aveiro 91 91 87 93 82 88 34 75
4 Kevin De Bruyne 91 91 76 86 93 88 64 78
5 Neymar da Silva Santos Jr. 91 91 91 83 86 94 37 63
6 Kylian Mbappé 91 91 97 88 80 92 36 77
7 Harry Kane 90 90 70 91 83 83 47 83
8 N'Golo Kanté 90 90 78 66 75 82 87 83
9 Mohamed Salah 89 89 90 87 81 90 45 75
10 Karim Benzema 89 89 76 86 81 87 39 77
# … with 38 more rows
我更新了代码,以便您立即将前十个子页面刮入一个数据框架。请注意,刮痧的代码来自@Otto_Kässi答案,所以所有的功劳都应该归他!!
library(rvest)
library(stringr)
library(tidyverse)
url <- "https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps"
p1 <- str_c("https://www.futhead.com/22/players/",'?page=', 1:10)
pages <- paste0(p1,"&level=gold_nif&bin_platform=ps")
df <- tibble(player = character(),
overall= character(),
pac = character(),
sho = character(),
pas = character(),
dri = character(),
def = character(),
phy = character())
for (i in pages) {
i %>% read_html() %>%
html_nodes("[class='list-group list-group-table player-group-table']") %>%
html_nodes("[class='player-info']") %>% html_nodes("[class='player-image']") %>%
html_attr("alt") -> player_names
i %>% read_html() %>%
html_nodes("[class='player-right text-center hidden-xs']") %>%
html_nodes("[class='value']") %>%
html_text() %>%
matrix(nrow=length(player_names), ncol=6, byrow=T) -> player_stats
player_names %>% as_tibble() -> player_names
names(player_names) <- 'player'
substr(player_names$player, str_length(player_names$player)-1, str_length(player_names$player)) -> overall
player_names$overall <- overall
as_tibble(player_stats) -> player_stats
names(player_stats) <- c('pac','sho','pas','dri','def','phy')
#bind everything together
bind_cols(player_names, player_stats) -> players
df <- rbind(df, players)
rm(player_names); rm(player_stats); rm(players)
}
df <- df %>% mutate(player = str_replace_all(player, "[:digit:]", "")) %>% mutate_at(vars(2:7), as.numeric)
如果你一次运行整个代码,它应该工作!