试图从html表中抓取数据,
如果表有图像,则读取为 NA。我能够单独阅读图像标题
这是我尝试过的代码
weburl<- "https://www.transfermarkt.com/premier-
league/transfers/wettbewerb/GB1/plus/?
saison_id=2017&s_w=&leihe=0&leihe=1&intern=0&intern=1"
webcontent<-NULL
webcontent<-read_html(weburl)
table_text<-webcontent %>%
html_nodes(".responsive-table table") %>%
html_table()
###I am able to pull the Nationality individually but this could not be joined, as one Player could have two possible values
nationality_text<-webcontent %>%
html_nodes(".responsive-table table td.zentriert.nat-transfer-cell img") %>%
html_attr("title")
谁能帮我获取表格中图像的标题? 当前使用 Rvest 包
您可以使用
xml2
包(已由rvest
加载(进行破解。
我抓取每个标志的所有img
节点,并用分号分隔符将它们的文本替换为它们的 title
属性。然后,当您将表转换为 data.frame 时,html_text
会选取文本。
(请注意,这不是有效的XHTML,但它适用于rvest
:文本甚至没有导出为HTML(。
# Get flags using XPath
node_flags <- tables %>%
xml_find_all("//td[contains(@class, 'nat-transfer-cell')]/img")
countries <- node_flags %>%
xml_attr('title')
node_flags %>%
xml_set_text(paste0(countries, ';'))
# Resume extraction
table_text <- tables %>%
html_table()
国籍将在第 Nat.
栏中:
> table_text[[1]] %>% head
Arrivals Age Nat. Position Pos Market value Moving from Moving from Transfer fee
1 Álvaro MorataÁ. Morata 24 Germany; Centre-Forward CF 40,00 Mill. € NA Real Madrid 62,00 Mill. €
2 Tiemoué BakayokoT. Bakayoko 22 DR Congo;England; Defensive Midfield DM 16,00 Mill. € NA Monaco 40,00 Mill. €
3 Danny DrinkwaterD. Drinkwater 27 Albania; Central Midfield CM 9,00 Mill. € NA Leicester 37,90 Mill. €
4 Antonio RüdigerA. Rüdiger 24 England;England; Centre-Back CB 25,00 Mill. € NA AS Roma 35,00 Mill. €
5 Davide ZappacostaD. Zappacosta 25 Greece; Right-Back RB 8,50 Mill. € NA Torino 25,00 Mill. €
6 Ross BarkleyR. Barkley 24 Australia; Attacking Midfield AM 25,00 Mill. € NA Everton 16,90 Mill. €