r语言 - 是否有一种优雅的方法可以在数据集中使用"Primary Key value's"组合来计算组合键值的值?



注意:这不是一个可以用 zoo 轻松解决的时间严重问题(或者至少,我不明白如何

动物园这个问题:(我有一个数据集,其中包含许多"键列"和一个与仅设置其中一个键列的组合关联的值。 对于行的多个键列的数值可以基于"一个键列集"行进行计算。

使用正常的编程技术,这是相当简单(但混乱(的,如下所示。我希望在 R 中有一种更好、更优雅的方式来做到这一点。

在这个例子中,我有三个键,对于组合键值,例如 [1,1,0] = 我会根据两个主键 Val[1,0,0] 和 Val[0,1,0] 计算值。在这个例子中,我使用一个简单的平均值,这是 mean(2,5( = 3.5。

myMatrix <- tribble(
~`1`, ~`2`, ~`3`, ~Val,
0,0,0,1,
1,0,0,2,
2,0,0,2,
0,1,0,5,
1,1,0,NA,
2,1,0,NA,
0,2,0,6,
1,2,0,NA,
2,2,0,NA,
0,0,1,1,
1,0,1,NA,
2,0,1,NA,
0,1,1,NA,
1,1,1,NA,
2,1,1,NA,
0,2,1,NA,
1,2,1,NA,
2,2,1,NA
)
#Filter for NA in the Val col
tmpNARows <- myMatrix %>% filter(is.na(Val)) %>% select(-Val)
#Take the 
tmpFirstRow <- TRUE
for (myR in 1:nrow(tmpNARows)) {
#For each row in the NA table
tmpMyNARow<-tmpNARows[myR,]
tmpFirstElement <- TRUE
for (myC in 1:ncol(tmpMyNARow)) {
#find the records that make up this one's parts 
#ignore columns with value 0
if (0 != tmpMyNARow[myC]) { 
#Make Base Record for lookup
tmpMyBaseRow <- tmpMyNARow
for (myC2 in 1:ncol(tmpMyNARow)) {
if (myC2!=myC) { tmpMyBaseRow[myC2] <- 0 }
}
if(tmpFirstElement == TRUE) {
#Make a new Base table
tmpMyBaseTable <- tmpMyBaseRow       
tmpFirstElement <- FALSE
} else {
#Append the Base row to the Base table
tmpMyBaseTable <- union(tmpMyBaseTable, tmpMyBaseRow)
}
}
}
#Calculate the mean and store in as Val
tmpVal <- (left_join(tmpMyBaseTable, myMatrix) %>% summarise(mean(Val)))[[1]]
tmpMyNARowWithVal <- tmpMyNARow %>% mutate(Val = tmpVal)
if (tmpFirstRow == TRUE) {
tmpMyResultMatrix <- tmpMyNARowWithVal
tmpFirstRow <- FALSE
} else {
tmpMyResultMatrix <- union(tmpMyResultMatrix,tmpMyNARowWithVal)
}
}
#filter for non NA
tmpNonNARows <- myMatrix %>% filter(!is.na(Val))
#Add the calculated rows
myCalculatedMatrix <- union(tmpNonNARows, tmpMyResultMatrix)
#lets have a look
myCalculatedMatrix
#the (1,1,0) element is indeed 3.5 so it appears to be working.

预期结果应如下所示

myCalculatedMatrix %>% arrange_all()
# A tibble: 18 x 4
`1`   `2`   `3`      Val
<dbl> <dbl> <dbl>    <dbl>
1     0     0     0 1.000000
2     0     0     1 1.000000
3     0     1     0 5.000000
4     0     1     1 3.000000
5     0     2     0 6.000000
6     0     2     1 3.500000
7     1     0     0 2.000000
8     1     0     1 1.500000
9     1     1     0 3.500000
10     1     1     1 2.666667
11     1     2     0 4.000000
12     1     2     1 3.000000
13     2     0     0 2.000000
14     2     0     1 1.500000
15     2     1     0 3.500000
16     2     1     1 2.666667
17     2     2     0 4.000000
18     2     2     1 3.000000

虽然这个问题被明确标记为dplyr但我从一个data.table的解决方案开始,我希望它更"优雅"。至少它避免了嵌套的for循环。

编辑:我添加了data.table方法的dplyr/tidyr版本。


OP 有一个数据集,其中包含许多"键列"和一个与仅设置其中一个键列的组合关联的值。然后是第二个数据集,其中设置了多个键列并且缺少值。该任务是根据第一个数据集的"一个键列集"行计算缺失值。

不幸的是,给定的数据myMatrix包含两个数据集的混合,这增加了问题的复杂性。

data.table解决方案

library(data.table)
# convert to data.table, add column with row numbers for subsequent join
DT <- data.table(myMatrix)[, rn := .I]
# reshape from wide to long format, 
# rename column using a self-explanatory name
DT_long <- melt(DT, id.vars = c("rn", "Val"), na.rm  = TRUE, value.name = "key")
# extract primary keys
primary_keys <- DT_long[!is.na(Val) & key > 0]
primary_keys
rn Val variable key
1:  2   2        1   1
2:  3   2        1   2
3:  4   5        2   1
4:  7   6        2   2
5: 10   1        3   1
# right join to keep all rows in DT_long
result <- primary_keys[DT_long, on = c("variable", "keys")][
# calculate new Val by aggregating row-wise
, .(calcVal = mean(c(Val, i.Val), na.rm = TRUE)), by = .( rn = i.rn)]        
result
rn  calcVal
1:  1 1.000000
2:  2 2.000000
3:  3 2.000000
4:  4 5.000000
5:  5 3.500000
6:  6 3.500000
7:  7 6.000000
8:  8 4.000000
9:  9 4.000000
10: 10 1.000000
11: 11 1.500000
12: 12 1.500000
13: 13 3.000000
14: 14 2.666667
15: 15 2.666667
16: 16 3.500000
17: 17 3.000000
18: 18 3.000000
# join calculated values with original table, remove row numbers as no longer needed
result <- result[DT, on = "rn"][, rn := NULL][]
# beautify result for easier comparison
result[, setcolorder(.SD, c(names(myMatrix), "calcVal"))][, setorderv(.SD, names(.SD))]
1 2 3 Val  calcVal
1: 0 0 0   1      NaN
2: 0 0 1   1 1.000000
3: 0 1 0   5 5.000000
4: 0 1 1  NA 3.000000
5: 0 2 0   6 6.000000
6: 0 2 1  NA 3.500000
7: 1 0 0   2 2.000000
8: 1 0 1  NA 1.500000
9: 1 1 0  NA 3.500000
10: 1 1 1  NA 2.666667
11: 1 2 0  NA 4.000000
12: 1 2 1  NA 3.000000
13: 2 0 0   2 2.000000
14: 2 0 1  NA 1.500000
15: 2 1 0  NA 3.500000
16: 2 1 1  NA 2.666667
17: 2 2 0  NA 4.000000
18: 2 2 1  NA 3.000000

请注意,上面的data.table代码是为了解释处理步骤而编写的。使用更多链接重写代码将使其更加简洁,因为可以跳过一些中间结果。

dplyr/tidyr解决方案

下面的代码是data.table解决方案的"翻译":

library(dplyr)
library(tidyr)
tmpMatrix <- myMatrix %>% 
mutate(rn = row_number()) 
tmpLong <- tmpMatrix  %>% 
gather(Col, Keys, -Val, -rn) %>% 
print()
tmpPrimKeys <- tmpLong %>% 
filter(!is.na(Val) & Keys > 0) %>% 
select(-rn) %>% 
print()   
tmpLong %>% 
left_join(tmpPrimKeys, by = c("Col", "Keys")) %>% 
group_by(rn) %>% 
summarise(calcVal = mean(c(Val.x, Val.y), na.rm = TRUE)) %>% 
inner_join(tmpMatrix, by = "rn") %>% 
select(num_range("", 1:3), Val, calcVal) %>% 
arrange_all()
# A tibble: 18 x 5
`1`   `2`   `3`   Val  calcVal
<dbl> <dbl> <dbl> <dbl>    <dbl>
1     0     0     0     1 1.000000
2     0     0     1     1 1.000000
3     0     1     0     5 5.000000
4     0     1     1    NA 3.000000
5     0     2     0     6 6.000000
6     0     2     1    NA 3.500000
7     1     0     0     2 2.000000
8     1     0     1    NA 1.500000
9     1     1     0    NA 3.500000
10     1     1     1    NA 2.666667
11     1     2     0    NA 4.000000
12     1     2     1    NA 3.000000
13     2     0     0     2 2.000000
14     2     0     1    NA 1.500000
15     2     1     0    NA 3.500000
16     2     1     1    NA 2.666667
17     2     2     0    NA 4.000000
18     2     2     1    NA 3.000000

最新更新