根据与字典中任何术语的匹配创建二进制是/否动物变量,"animal" R



继续这个问题:R:创建反映字典和df中的列之间匹配的类别列我有一个很大的数据集;df";,30000行,以及两个大字典数据帧:(1(动物,600k行;(2( 自然,300k行。

我只是想弄清楚如何创建两个简单的二进制变量;df$content_animal"以及";df@content_nature"基于df$content中的每一行是否与"df$内容"有任何匹配;动物;或";自然;字典。(1=匹配,0=不匹配(。

以下是数据样本,我不可能在这里包括整个数据集:

df <- tibble(content= c("hello turkey feet blah blah blah", "i love rabbits haha", "wow this sunlight is amazing", "omg did u see the rainbow?!", "turtles like swimming in the water", "i love running across grassy lawns with my dog"))
animal=c("turkey", "rabbit", "turtle", "dog", "cat", "bear")
nature=c("sunlight", "water", "rainbow", "grass", "lawn", "mountain", "ice")

我尝试了以下基于多个模式匹配的代码,但没有成功——我怀疑这是因为我的数据集和字典/模式都很大:

df$content_animal <- grepl(paste(animal,collapse="|"),df$content,ignore.case=TRUE)
df$content_nature <- grepl(paste(nature,collapse="|"),df$content,ignore.case=TRUE)

返回错误:

Error in grepl(paste(animal,collapse="|"), df$content,  : 
invalid regular expression, reason 'Out of memory' Error in grepl(paste(nature,collapse="|"), df$content,  : 
invalid regular expression, reason 'Out of memory'

我也试过:

df<-df %>%
mutate(
content_animal = case_when(grepl(animal, content) ~ "1")
)
df<-df %>%
mutate(
content_nature = case_when(grepl(nature, content) ~ "1")
)

返回错误:

Problem with `mutate()` input `content_animal`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
ℹ Input `content_animal` is `case_when(grepl(animal, content) ~ "1")`.argument 'pattern' has length > 1 and only the first element will be used
Problem with `mutate()` input `content_nature`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
ℹ Input `content_nature` is `case_when(grepl(nature, content) ~ "1")`.argument 'pattern' has length > 1 and only the first element will be used

我也试过

bench::mark(basic = mutate(df, content_animal = 1L*map_lgl(content, ~any(str_detect(.x, animal))),
content_nature = 1L*map_lgl(content, ~any(str_detect(.x, nature)))),
fixed = mutate(df, content_animal = 1L*map_lgl(content, ~any(str_detect(.x, fixed(animal)))),
content_nature = 1L*map_lgl(content, ~any(str_detect(.x, fixed(nature))))))

它运行了两个多小时,没有给我任何输出。

我真的不知道该怎么办。有人有什么想法吗?有没有更好的包或代码可以用于我的大数据目的???

使用lapplyReduce循环可能更好

Reduce(`|`, lapply(nature, function(x) grepl(x, df$content, ignore.case = TRUE)))
#[1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE

与相同

grepl(paste(nature,collapse="|"),df$content,ignore.case=TRUE)
#[1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE

这里有一个quanteda包的方法,它有内置的功能,可以做您想要做的事情。(我只在样本数据集上尝试过;我很想听听它在整个数据集上的性能如何。(

library(quanteda)
c = corpus(df$content)
d = dictionary(list(animal = animal, nature = nature))
df = cbind(df, convert(dfm(c, dictionary = d), to = "data.frame")[,-1])

最新更新