我的电脑似乎对这个求和很慢:
library(plyr)
# Function for creating random n pseudowords of predefined length needed for colnames. Proposed by: http://ryouready.wordpress.com/2008/12
colnamesString <- function(n=10000, lenght=12) /18/generate-random-string-name/
{
randomString <- c(1:n) # initialize vector
for (i in 1:n)
{
randomString[i] <- paste(sample(c(0:9, letters, LETTERS),
lenght, replace=TRUE),
collapse="")
}
return(randomString)
}
set.seed(1)
myColnames <- strsplit(colnamesString(10000,8), " ") # vector with 10000 random colnames of length 8
datfra <- data.frame(matrix(data = sample(c(0,1),(10000*1500), replace= TRUE), nrow= 1500, ncol= 10000, dimnames= list(NULL, myColnames))) # creates a dandom dataframe with the colnames created before with binary (not essential, for readablity) values.
datfra <- cbind(datfra, colID=(sample(c(1:150),1500, replace= TRUE))) # creates IDs vector
datfra[1:5,c(1:3,10001)] # small section of the created dataframe, with coresponding IDs
coldatfra <- ddply(datfra[1:50,c(1:5,10001)], .(colID), numcolwise(sum)) # The solution for a small subset of the big dataframe.
#It works fine! But if applied to the whole dataframe it never ends computing.
# Therefore the challange is how to compute efficiently with an ALTERNATIVE approach to this?
coldatfra <- ddply(datfra, .(colID), numcolwise(sum)) # stopped after 15m of computing
编辑开始
其目的是针对每个唯一的colID
,逐列汇总所有列中的所有条目。实验结果为:
coldatfra[1:10,c(1:5,10001)] # Small subset of rows, only for five columns + colID colum:
gnzUcTWE D3caGnLu IZnMVdE7 gn0nRltB ubPFN6Ip colID
1 3 4 5 5 6 12
2 10 8 7 4 7 24
3 4 8 4 5 5 36
4 2 4 6 5 5 36
5 5 6 6 6 7 55
6 5 2 4 3 4 42
7 5 3 6 5 4 63
8 8 12 8 8 10 160
9 7 3 5 3 3 90
10 2 3 1 2 2 60
编辑
library(data.table)
res <- data.table(datfra)[, lapply(.SD, sum), by=colID]
# user system elapsed
# 8.32 0.05 8.38
这大约是ddply
版本的4.5倍。不幸的是,这仍然有些缓慢。
旧东西:
如果我理解你试图正确地做什么,你可以更快地做到这一点,首先计算所有列的行和,然后按组聚合:
datfrasum <-
data.frame(
sums=rowSums(datfra[, names(datfra) != "colID"]),
colID=datfra$colID
)
ddply(datfrasum, .(colID), colSums)
# user system elapsed
# 0.37 0.02 0.39
在这种情况下,非常缓慢的步骤是尝试为这么多列生成所有组,因此这要快得多。一般来说,您希望使用data.table
或dplyr
而不是plyr
,因为后者现在在性能方面落后于其他两个,但即使使用了这些,您也应该首先考虑列折叠。
这里有一个data.table
的替代方法,但因为它不首先进行行和运算,所以实际上比上面的方法慢:
library(data.table)
dattab <- data.table(datfra)
dattab[, sum(unlist(.SD)), by=colID]
如果您进行行和运算并使用data.table
,则速度会更快。