我正在尝试刮页面https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads并且可以使用rvest 将文本数据删除
library(plyr)
library(XML)
library(rvest)
library(dplyr)
library(magrittr)
library(data.table)
for(i in 1:16)
{
float <- paste("squad", i, sep ="")
print(float)
html = read_html("https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads")
assign(float, html_table(html_nodes(html, "table")[[i]]))
}
但也希望添加一个额外的列,其中每个表上都有俱乐部的URL。例如,对于1队(页面上的波兰队,截断后仅显示前5名球员)
0#0 Pos. Player Date of birth (age) Caps Goals Club
1 1 1GK Wojciech Szczęsny (1990-04-18)18 April 1990 (aged 22) 11 0 Arsenal
2 2 2DF Sebastian Boenisch (1987-02-01)1 February 1987 (aged 25) 9 0 Werder Bremen
3 3 2DF Grzegorz Wojtkowiak (1984-01-26)26 January 1984 (aged 28) 19 0 Lech Poznań
4 4 2DF Marcin Kamiński (1992-01-15)15 January 1992 (aged 20) 3 0 Lech Poznań
5 5 3MF Dariusz Dudka (1983-12-09)9 December 1983 (aged 28) 65 2 Auxerre
6 6 3MF Adam Matuszczyk (1989-02-14)14 February 1989 (aged 23) 20 1 Fortuna Düsseldorf
我想在";俱乐部;对于";clubURL";这将显示该俱乐部的维基百科网址。例如,第一个球员为阿森纳效力,所以要为阿森纳取得积分榜上的链接并创建:
0#0 Pos. Player Date of birth (age) Caps Goals Club
1 1 1GK Wojciech Szczęsny (1990-04-18)18 April 1990 (aged 22) 11 0 Arsenal
clubURL
1 https://en.wikipedia.org/wiki/Arsenal_F.C.
等等。我发现rvest表抓取包括链接,但无法让这个例子发挥作用,也无法做我想做的事情。很抱歉,如果在其他地方被问到了,
谢谢,
我用页面上的第一个表做了一个例子。您可以根据需要对此进行扩展。
首先,获取第一个表并使用html_table
进行保存。然后,我创建了一个助手函数,在给定链接文本的情况下,从表中提取链接。然后,我使用sapply
在数据帧中填充一个新列。
library("rvest")
url <- "https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads"
mytable <- read_html(url) %>% html_nodes("table") %>% .[[1]]
df <- mytable %>% html_table()
get_link <- function(html_table, team){
html_table %>%
html_nodes(xpath=paste0("//a[text()='", team, "']")) %>%
.[[1]] %>%
html_attr("href")
}
df$club_link <- sapply(df$Club, function(x)get_link(mytable, x))
> head(df)
0#0 Pos. Player
1 1 1GK Wojciech Szczęsny
2 2 2DF Sebastian Boenisch
3 3 2DF Grzegorz Wojtkowiak
4 4 2DF Marcin Kamiński
5 5 3MF Dariusz Dudka
6 6 3MF Adam Matuszczyk
Date of birth (age) Caps Goals
1 (1990-04-18)18 April 1990 (aged 22) 11 0
2 (1987-02-01)1 February 1987 (aged 25) 9 0
3 (1984-01-26)26 January 1984 (aged 28) 19 0
4 (1992-01-15)15 January 1992 (aged 20) 3 0
5 (1983-12-09)9 December 1983 (aged 28) 65 2
6 (1989-02-14)14 February 1989 (aged 23) 20 1
Club club_link
1 Arsenal /wiki/Arsenal_F.C.
2 Werder Bremen /wiki/SV_Werder_Bremen
3 Lech Poznań /wiki/Lech_Pozna%C5%84
4 Lech Poznań /wiki/Lech_Pozna%C5%84
5 Auxerre /wiki/AJ_Auxerre
6 Fortuna Düsseldorf /wiki/Fortuna_D%C3%BCsseldorf