r——应用频率加权单词表的字符计数方法



好的,我有一个单词列表和它们的频率。有很多,成千上万这样的。这里有一个小例子:

w = c("abandon", "break", "fuzz", "when")
f = c(2, 10, 8, 200)
df = data.frame(cbind(w, f))
df
w   f
1 abandon   2
2   break  10
3    fuzz   8
4    when 200

我想做的是计算每个单词中的字符数,然后汇总结果。dw4psy包中的count_chars函数可以对给定的字符串向量执行此操作。我通过从单词列表中创建一个巨大的字符串向量(它有1000个单词中的10个(成功地做到了这一点,如下所示:

library(ds4psy) # for count_chars function 
library(dplyr)
w = c("abandon", "break", "fuzz", "when")
f = c(2, 10, 8, 200)
df = data.frame(cbind(w, f))
df$w = as.character(df$w)
df$f = as.integer(df$f)
# repword will repeat wrd frq times with no spaces between
repword <- function(frq, wrd) paste(rep(times=frq, x=wrd), collapse="")
# now we create one giant vector of strings to do the counts on 
# CAUTION -- uses lots of memory when you have 10s of 1000s of words
mytext = paste(mapply(repword,  df$f, df$w))
# get a table of letter counts
mycounts = count_chars(mytext)
# convert to data frame sorted by character
mycounts.df <- mycounts[order(names(mycounts))] %>%
as.data.frame()
# sort by Freq in descending order
mycounts.df %>% 
arrange(desc(Freq))

然而,一位同事没有足够的内存来使用这种暴力解决方案。所以我试着用foreach或mapply逐字逐句地想办法做到这一点,但我真的被卡住了。

一个问题是,你需要一个包含每个字母的向量来组合它们(据我所知(。所以我创建了一个包含所有字母的虚词,然后做了一些调整,以防止它每次都计算重复的字母。

# create a dummy string that is a-z
dummy = paste0(letters, collapse="")
# now we create a count - it will be all 1s; we will subtract it every time
dummycount = count_chars(dummy)

countword <- function(frq, wrd) {
myword = paste0(dummy, wrd, collapse="")
# subtract 1 from each letter to correct for dummy
mycount = count_chars(myword) - dummycount 
mycount = mycount * frq # multiply by frequency
return(mycount)
}
totalcount = dummycount - 1 # set a table to zeroes

foreach(frq = df$f, wrd = df$w) %do% {
totalcount = totalcount + countword(frq, wrd)
}

但这根本不起作用。。。我得到一个奇怪的结果:


> totalcount
chars
a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z 
16 12 10  6  3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 

如果有任何建议,我将不胜感激!

如果我们想要foreach的相同输出(假设OP想要使用foreach(,只需在的行序列上循环

library(foreach)
library(parallel)
library(doSNOW)
no_of_cores = detectCores()
cl <- makeSOCKcluster(no_of_cores)
registerDoSNOW(cl)
out <- foreach(i = 1:nrow(df), .export = "count_chars", 
.combine = `+`) %dopar% {
tmp <- countword(df$f[i], df$w[i])
totalcount[names(tmp)] <- totalcount[names(tmp)] + tmp
totalcount}
stopCluster(cl)

-输出

> out
a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z 
14  12   0   2 210   8   0 200   0   0  10   0   0 204   2   0   0  10   0   0   8   0 200   0   0  16 

您可以简单地将count_chars()的输出乘以f,并按行执行吗?

library(data.table)
setDT(df)[, data.table(count_chars(w)*f), by=1:nrow(df)][, .(ct = sum(N)), chars][order(-ct)]

输出:

chars  ct
1:     e 210
2:     n 204
3:     h 200
4:     w 200
5:     z  16
6:     a  14
7:     b  12
8:     k  10
9:     r  10
10:     f   8
11:     u   8
12:     d   2
13:     o   2

最新更新