如何将字符对象(已解析的网页)转换为R中的整洁对象



使用

library(htm2txt)
url <- 'https://en.wikipedia.org/wiki/Alan_Turing'
clear.text <- gettxt(url)

代码我得到

clear.text
[1] "Alan TuringnnFrom Wikipedia, the free encyclopediannJump to navigationtJump to searchnn"Turing" redirects here. For other uses, see Turing (disambiguation).nnmathematician and computer scientistnnAlan TuringnnOBE FRSnnTuring aged 16nnBorn (1912-06-23)23 June 1912nnM...

我想把这些数据存储在整洁的对象中,比如:

tidy.text <- tidy(clear.text)

但是我有

'tidy.character' is deprecated.

结果是

# A tibble: 1 x 1
           x
       <chr>
1 "Alan TuringnnFrom Wikipedia, the free encyclopediannJump to navigationtJum
> 

因此,我如何将这样的纯文本转换为整洁的格式?

感谢您的预付款。

如果你有维基百科链接或其他HTML,tidytext中的unnest_tokens()函数可以直接解析和整理它。

library(tidytext)
library(tidyverse)
read_lines("https://en.wikipedia.org/wiki/Alan_Turing") %>%
data_frame(text = .) %>%
unnest_tokens(word, text, format = "html")
#> # A tibble: 15,460 x 1
#>    word     
#>    <chr>    
#>  1 alan     
#>  2 turing   
#>  3 wikipedia
#>  4 this     
#>  5 is       
#>  6 a        
#>  7 good     
#>  8 article  
#>  9 follow   
#> 10 the      
#> # ... with 15,450 more rows

创建于2018-12-18由reprex包(v0.2.1(

最新更新