我有一个函数,使用rvest
从网页中提取数据。函数如下(这不是很重要):
processCardPackMinimalRealtorInfo = function(rowPosition){
# collect realtor information
realEstateInformation = bind_cols(
realEstateCompanyName = CardPackMinimal[rowPosition] %>%
html_elements('.re-CardPromotionLogo') %>%
html_nodes("a") %>%
html_children() %>%
html_attr("title"),
realEstatePageLink = CardPackMinimal[rowPosition] %>%
html_elements('.re-CardPromotionLogo') %>%
html_nodes("a") %>%
html_attr('href') %>%
paste("https://www.fotocasa.es", ., sep = "")
)
return(realEstateInformation)
}
如果没有"error"但当它遇到"没有信息"时;它返回一个0的tibble
。因此,我试图将此函数包装成purrr
,possibly
函数以在tibble
为0时返回NA
值,但我无法看到possibly
函数在没有信息时返回NA
的数据帧。
possiblyProcessCardPackMinimalRealtorInfo = possibly(processCardPackMinimalRealtorInfo,
otherwise = tibble(
realEstateCompanyName = NA_character_,
realEstatePageLink = NA_character_
))
我的问题是,当收集的数据不存在时,我如何纠正possibly
函数以返回NA
值-即tibble
是0 x 2(在这种情况下- 2列是原始函数中生成的realEstateCompanyName
和realEstatePageLink
)。
提前为没有dput
或样本数据道歉,这些数据涉及网络抓取,需要几个小时来处理。
函数processCardPackMinimalRealtorInfo
应该在没有行输出时抛出错误,以便possibly
可以处理:
library(tibble)
library(purrr)
data0 <- tibble(realEstateCompanyName = character(0),
realEstatePageLink = character(0))
data1 <- tibble(realEstateCompanyName = "a",
realEstatePageLink = "b")
processCardPackMinimalRealtorInfo <- function(data) { if (nrow(data)==0) stop('no rows');data}
processCardPackMinimalRealtorInfo(data0)
#> Error in processCardPackMinimalRealtorInfo(data0): no rows
list(data1,data0) %>% map(possibly(processCardPackMinimalRealtorInfo,
otherwise = tibble(
realEstateCompanyName = NA_character_,
realEstatePageLink = NA_character_
)))
#> [[1]]
#> # A tibble: 1 × 2
#> realEstateCompanyName realEstatePageLink
#> <chr> <chr>
#> 1 a b
#>
#> [[2]]
#> # A tibble: 1 × 2
#> realEstateCompanyName realEstatePageLink
#> <chr> <chr>
#> 1 <NA> <NA>
另一种可能是在函数本身中处理0行:
processCardPackMinimalRealtorInfo <- function(data) {
if (nrow(data)==0) data = tibble(realEstateCompanyName = NA_character_,
realEstatePageLink = NA_character_)
data
}
list(data1,data0) %>% map(processCardPackMinimalRealtorInfo)
#> [[1]]
#> # A tibble: 1 × 2
#> realEstateCompanyName realEstatePageLink
#> <chr> <chr>
#> 1 a b
#>
#> [[2]]
#> # A tibble: 1 × 2
#> realEstateCompanyName realEstatePageLink
#> <chr> <chr>
#> 1 <NA> <NA>