使用
library(htm2txt)
url <- 'https://en.wikipedia.org/wiki/Alan_Turing'
clear.text <- gettxt(url)
代码我得到
clear.text
[1] "Alan TuringnnFrom Wikipedia, the free encyclopediannJump to navigationtJump to searchnn"Turing" redirects here. For other uses, see Turing (disambiguation).nnmathematician and computer scientistnnAlan TuringnnOBE FRSnnTuring aged 16nnBorn (1912-06-23)23 June 1912nnM...
我想把这些数据存储在整洁的对象中,比如:
tidy.text <- tidy(clear.text)
但是我有
'tidy.character' is deprecated.
结果是
# A tibble: 1 x 1
x
<chr>
1 "Alan TuringnnFrom Wikipedia, the free encyclopediannJump to navigationtJum
>
因此,我如何将这样的纯文本转换为整洁的格式?
感谢您的预付款。
如果你有维基百科链接或其他HTML,tidytext中的unnest_tokens()
函数可以直接解析和整理它。
library(tidytext)
library(tidyverse)
read_lines("https://en.wikipedia.org/wiki/Alan_Turing") %>%
data_frame(text = .) %>%
unnest_tokens(word, text, format = "html")
#> # A tibble: 15,460 x 1
#> word
#> <chr>
#> 1 alan
#> 2 turing
#> 3 wikipedia
#> 4 this
#> 5 is
#> 6 a
#> 7 good
#> 8 article
#> 9 follow
#> 10 the
#> # ... with 15,450 more rows
创建于2018-12-18由reprex包(v0.2.1(