计算一次观测对R中整体结果的影响



我有一个概览表-项目计数、实际成本和预测成本的列表

myData <- data.table("itemCount" = c(3000, 20, 50, 9),
"cost" = c(120, 118, 165, 93), 
"prediction" = c(120, 100, 150, 120))

然后我计算个人和整体利润:

myData[, "profit" := cost/prediction]
total <- myData[, .(itemsTotal = sum(itemCount),
costTotal  = sum(cost), 
predictionTotal = sum(prediction))][
, "profit" := costTotal/predictionTotal 
]

现在,对于每一行,我想计算一下如果将该行排除在分析之外,整体利润会是多少。例如,如果第二行缺失:

myData$diffinProfit <- NA
myDataEx <- myData[- 2, ]
totalEx <- myDataEx[, .(itemsTotal = sum(itemCount),
costTotal  = sum(cost), 
predictionTotal = sum(prediction))][
, "profit" := costTotal/predictionTotal 

所以我写了一个for循环来做这个

myData$diffinProfit <- NA
for(observation in seq_along(length(myData)-1)){

myDataEx <- myData[- observation, ]
totalEx <- myDataEx[, .(itemsTotal = sum(itemCount),
costTotal  = sum(cost), 
predictionTotal = sum(prediction))][
, "profit" := costTotal/predictionTotal 
]

myData$diffinProfit[[observation]] <- totalEx$profit

}

然而,我只得到第一次观察的结果。如何找到for循环?有什么方法可以让我使用apply函数吗?我在考虑mapply?或者可能是purrr函数?

您遇到的第一个问题是length(myData)报告的是列数,而不是行数。但我认为我们可以不使用for循环(尽管sapply在更深层次的代码中与之相似(。

myData[, otherProfit := sapply(seq_len(.N), function(z) sum(cost[-z])/sum(prediction[-z]))]
myData
#    itemCount  cost prediction profit otherProfit
#        <num> <num>      <num>  <num>       <num>
# 1:      3000   120        120  1.000   1.0162162
# 2:        20   118        100  1.180   0.9692308
# 3:        50   165        150  1.100   0.9735294
# 4:         9    93        120  0.775   1.0891892

尽管从数学上讲,完全不需要循环也可以做到:

sumcost <- sum(myData$cost)
sumpred <- sum(myData$prediction)
myData[, profit2 := (sumcost-cost)/(sumpred-prediction)]
myData
#    itemCount  cost prediction profit otherProfit   profit2
#        <num> <num>      <num>  <num>       <num>     <num>
# 1:      3000   120        120  1.000   1.0162162 1.0162162
# 2:        20   118        100  1.180   0.9692308 0.9692308
# 3:        50   165        150  1.100   0.9735294 0.9735294
# 4:         9    93        120  0.775   1.0891892 1.0891892

我不打算对4行进行基准测试,但如果这第二行";矢量化的";该方法并不比上面的sapplyfor循环替代方案更有效。

您可以使用行ID和data.table本机.GRP组计数器

library(data.table)
myData <- data.table("itemCount" = c(3000, 20, 50, 9),
"cost" = c(120, 118, 165, 93), 
"prediction" = c(120, 100, 150, 120))
myData[, "profit" := cost/prediction]
# assign row ids
myData[, ID := .I]
# loop over each row and take all values that are not in the current row
# .GRP is a group identifier and since you loop over all rows, there are as many groups as rows
myData[, total_profit_excl := myData[ID != .GRP, sum(cost) / sum(prediction)],
by = ID]
myData
#>    itemCount cost prediction profit ID total_profit_excl
#> 1:      3000  120        120  1.000  1         1.0162162
#> 2:        20  118        100  1.180  2         0.9692308
#> 3:        50  165        150  1.100  3         0.9735294
#> 4:         9   93        120  0.775  4         1.0891892

最新更新