R-高维稀疏数据帧的有效选择性和



我的电脑似乎对这个求和很慢:

library(plyr)
# Function for creating random n pseudowords of predefined length needed for colnames. Proposed by: http://ryouready.wordpress.com/2008/12
colnamesString <- function(n=10000, lenght=12) /18/generate-random-string-name/
{
randomString <- c(1:n)                  # initialize vector
for (i in 1:n)
{
randomString[i] <- paste(sample(c(0:9, letters, LETTERS),
lenght, replace=TRUE),
collapse="")
}
return(randomString)
}
set.seed(1)
myColnames <- strsplit(colnamesString(10000,8), " ") # vector with 10000 random colnames of length 8
datfra <- data.frame(matrix(data = sample(c(0,1),(10000*1500), replace= TRUE), nrow= 1500, ncol= 10000, dimnames= list(NULL, myColnames))) # creates a dandom dataframe with the colnames created before with binary (not essential, for readablity) values.
datfra <- cbind(datfra, colID=(sample(c(1:150),1500, replace= TRUE))) # creates IDs vector
datfra[1:5,c(1:3,10001)] # small section of the created dataframe, with coresponding IDs
coldatfra <- ddply(datfra[1:50,c(1:5,10001)], .(colID), numcolwise(sum)) # The solution for a small subset of the big dataframe.
#It works fine! But if applied to the whole dataframe it never ends computing.
# Therefore the challange is how to compute efficiently with an ALTERNATIVE approach to this?
coldatfra <- ddply(datfra, .(colID), numcolwise(sum)) # stopped after 15m of computing

编辑开始

其目的是针对每个唯一的colID,逐列汇总所有列中的所有条目。实验结果为:

coldatfra[1:10,c(1:5,10001)] # Small subset of rows, only for five columns + colID colum:
gnzUcTWE D3caGnLu IZnMVdE7 gn0nRltB ubPFN6Ip colID
1         3        4        5        5        6    12
2        10        8        7        4        7    24
3         4        8        4        5        5    36
4         2        4        6        5        5    36
5         5        6        6        6        7    55
6         5        2        4        3        4    42
7         5        3        6        5        4    63
8         8       12        8        8       10   160
9         7        3        5        3        3    90
10        2        3        1        2        2    60

编辑

编辑:我想我误解了OP,这是我对保留列的新理解:
library(data.table)
res <- data.table(datfra)[, lapply(.SD, sum), by=colID]
# user  system elapsed 
# 8.32    0.05    8.38     

这大约是ddply版本的4.5倍。不幸的是,这仍然有些缓慢。


旧东西:

如果我理解你试图正确地做什么,你可以更快地做到这一点,首先计算所有列的行和,然后按组聚合:

datfrasum <- 
data.frame(
sums=rowSums(datfra[, names(datfra) != "colID"]), 
colID=datfra$colID
)
ddply(datfrasum, .(colID), colSums)
# user  system elapsed 
# 0.37    0.02    0.39 

在这种情况下,非常缓慢的步骤是尝试为这么多列生成所有组,因此这要快得多。一般来说,您希望使用data.tabledplyr而不是plyr,因为后者现在在性能方面落后于其他两个,但即使使用了这些,您也应该首先考虑列折叠。

这里有一个data.table的替代方法,但因为它不首先进行行和运算,所以实际上比上面的方法慢:

library(data.table)
dattab <- data.table(datfra)
dattab[, sum(unlist(.SD)), by=colID]

如果您进行行和运算并使用data.table,则速度会更快。

最新更新