r-Quantada-从包含多个文档的数据帧创建语料库



这里的第一个问题,所以为任何失礼道歉。我有一个R中的数据帧,包含657个观测值和4个变量。每一次观察都是澳大利亚总理的演讲或采访。所以变量是:

  • 日期
  • 所有权
  • URL
  • txt(演讲/访谈全文(

我正试图将其转化为Quantada 中的语料库

我第一次尝试corp <- corpus(all_content),但它给了我一个错误消息

Error in corpus.data.frame(all_content) : 
text_field column not found or invalid

尽管如此:corp <- corpus(paste(all_content))

然后summary(corp)给了我

Corpus consisting of 4 documents, showing 4 documents:
Text Types  Tokens Sentences
text1   243    1316         1
text2  1095    6523         3
text3   661    2630         1
text4 25243 1867648     62572

我的理解是,这样做的目的是有效地将每一列变成一个文档,而不是每一行?

如果重要,txt变量将保存为列表。用于创建每一行的代码是

```{r new_function}
scrape_speech <- function(url){
speech_page <- read_html(url)

date <- speech_page %>% html_nodes(".date-display-single") %>% html_text() %>% dmy()
title <- speech_page %>% html_nodes(".pagetitle") %>% html_text()
txt <- speech_page %>% html_nodes("#block-system-main p") %>% html_text() %>% list()

tibble (date = date, title = title, URL = url, txt=txt)}

然后,我使用map_dfr函数浏览并抓取657个单独的URL。

有人向我建议,这是因为txt被保存为列表。我尝试过在函数中不使用list(),得到了21904个观察结果,因为全文文档中的每个段落都变成了一个单独的观察结果。我可以用corp <- corpus(paste(all_content_not_list))把它变成一个语料库(同样,如果没有paste,我会得到与上面相同的错误(。这同样给了我语料库中的4个文档!summary(corp)给我

Corpus consisting of 4 documents, showing 4 documents:
Text Types  Tokens Sentences
text1   243   43810         1
text2  1092  214970        25
text3   657   87618         1
text4 25243 1865687     62626

提前感谢Daniel

很难准确地解决这个问题,因为您的data.frame对象没有可复制的示例,但如果结构包含您列出的变量,那么应该这样做:

corpus(all_content, text_field = "txt")

详见?corpus.data.frame。如果这样做不起作用,那么尝试将输出添加到问题中

str(all_content)

以便我们可以更详细地查看all_content对象中的内容。

在OP添加新数据后编辑:

好的,所以tibble中的txt是一个字符元素列表。您需要将这些字符组合为一个字符,以便将其用作corpus.data.frame()的输入。方法如下:

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dframe <- structure(list(
date = structure(18620, class = "Date"),
title = " Prime Minister's Christmas Message to the ADF",
URL = "https://www.pm.gov.au/media/prime-ministers-christmas-message-adf",
txt = list(c(
"G'day and Merry Christmas to everyone in our Australian Defence Force.",
"You know, throughout our history, successive Australian governments... And this year was no different.",
"God bless."
))
),
row.names = c(NA, -1L),
class = c("tbl_df", "tbl", "data.frame")
)
dframe$txt <- vapply(dframe$txt, paste, character(1), collapse = " ")
corp <- corpus(dframe, text_field = "txt")
print(corp, max_nchar = -1)
## Corpus consisting of 1 document and 3 docvars.
## text1 :
## "G'day and Merry Christmas to everyone in our Australian Defence Force. You know, throughout our history, successive Australian governments... And this year was no different. God bless."

reprex包于2021-04-08创建(v1.0.0(

最新更新