好的,我有一个单词列表和它们的频率。有很多,成千上万这样的。这里有一个小例子:
w = c("abandon", "break", "fuzz", "when")
f = c(2, 10, 8, 200)
df = data.frame(cbind(w, f))
df
w f
1 abandon 2
2 break 10
3 fuzz 8
4 when 200
我想做的是计算每个单词中的字符数,然后汇总结果。dw4psy
包中的count_chars
函数可以对给定的字符串向量执行此操作。我通过从单词列表中创建一个巨大的字符串向量(它有1000个单词中的10个(成功地做到了这一点,如下所示:
library(ds4psy) # for count_chars function
library(dplyr)
w = c("abandon", "break", "fuzz", "when")
f = c(2, 10, 8, 200)
df = data.frame(cbind(w, f))
df$w = as.character(df$w)
df$f = as.integer(df$f)
# repword will repeat wrd frq times with no spaces between
repword <- function(frq, wrd) paste(rep(times=frq, x=wrd), collapse="")
# now we create one giant vector of strings to do the counts on
# CAUTION -- uses lots of memory when you have 10s of 1000s of words
mytext = paste(mapply(repword, df$f, df$w))
# get a table of letter counts
mycounts = count_chars(mytext)
# convert to data frame sorted by character
mycounts.df <- mycounts[order(names(mycounts))] %>%
as.data.frame()
# sort by Freq in descending order
mycounts.df %>%
arrange(desc(Freq))
然而,一位同事没有足够的内存来使用这种暴力解决方案。所以我试着用foreach或mapply逐字逐句地想办法做到这一点,但我真的被卡住了。
一个问题是,你需要一个包含每个字母的向量来组合它们(据我所知(。所以我创建了一个包含所有字母的虚词,然后做了一些调整,以防止它每次都计算重复的字母。
# create a dummy string that is a-z
dummy = paste0(letters, collapse="")
# now we create a count - it will be all 1s; we will subtract it every time
dummycount = count_chars(dummy)
countword <- function(frq, wrd) {
myword = paste0(dummy, wrd, collapse="")
# subtract 1 from each letter to correct for dummy
mycount = count_chars(myword) - dummycount
mycount = mycount * frq # multiply by frequency
return(mycount)
}
totalcount = dummycount - 1 # set a table to zeroes
foreach(frq = df$f, wrd = df$w) %do% {
totalcount = totalcount + countword(frq, wrd)
}
但这根本不起作用。。。我得到一个奇怪的结果:
> totalcount
chars
a b c d e f g h i j k l m n o p q r s t u v w x y z
16 12 10 6 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
如果有任何建议,我将不胜感激!
如果我们想要foreach
的相同输出(假设OP想要使用foreach
(,只需在的行序列上循环
library(foreach)
library(parallel)
library(doSNOW)
no_of_cores = detectCores()
cl <- makeSOCKcluster(no_of_cores)
registerDoSNOW(cl)
out <- foreach(i = 1:nrow(df), .export = "count_chars",
.combine = `+`) %dopar% {
tmp <- countword(df$f[i], df$w[i])
totalcount[names(tmp)] <- totalcount[names(tmp)] + tmp
totalcount}
stopCluster(cl)
-输出
> out
a b c d e f g h i j k l m n o p q r s t u v w x y z
14 12 0 2 210 8 0 200 0 0 10 0 0 204 2 0 0 10 0 0 8 0 200 0 0 16
您可以简单地将count_chars()
的输出乘以f
,并按行执行吗?
library(data.table)
setDT(df)[, data.table(count_chars(w)*f), by=1:nrow(df)][, .(ct = sum(N)), chars][order(-ct)]
输出:
chars ct
1: e 210
2: n 204
3: h 200
4: w 200
5: z 16
6: a 14
7: b 12
8: k 10
9: r 10
10: f 8
11: u 8
12: d 2
13: o 2