r语言 - 多变量的条件合并



我想知道是否有一个更快,更优雅的数据。

假设我们有两个数据集

set.seed(1)
library(data.table)
DT1 <- data.table(income = runif(10, 0,1),
ID = 1:10)
DT2 <- data.table(height = runif(20,0,1),
weight = runif(20,0,1),
V1 = runif(20,0,1),
V2 = runif(20,0,2),
type = rep(c("Parents", "Children"),10))
DT2 <- DT2[order(type)][, ID := rep(1:10,2)]

第一个数据集是一个"家庭"。level dataset with household identifiers with ID, from 1:10.

有第二个数据集DT2,对于每个家庭ID,每个父母和孩子都有四个变量。我要做的是合并所有的变量(高度,权值,V1, V2)父结点和子结点的每一行/DT1的观测值。因此,我们将有8个变量要合并,4个用于父节点,4个用于子节点。

要做到这一点,我可以简单地这样写:

DT1[DT2[type == "Parents"], c("height_parents", "weight_parents",
"V1_parents", "V2_parents") := list(i.height, i.weight,
i.V1, i.V2), on = c(ID = "ID")]
DT1[DT2[type == "Children"], c("height_children", "weight_children",
"V1_children", "V2_children") := list(i.height, i.weight,
i.V1, i.V2), on = c(ID = "ID")]

输出如下:

income ID height_parents weight_parents V1_parents V2_parents height_children
1: 0.70647001  1     0.49163534    0.164385214  0.6806198 1.72937701      0.04655907
2: 0.07658058  2     0.06776809    0.234182275  0.4456820 1.76814822      0.45665042
3: 0.49770601  3     0.23255515    0.709256017  0.3514867 1.83387012      0.24395311
4: 0.51944306  4     0.30555999    0.974742471  0.0529102 0.06094086      0.22356168
5: 0.23075737  5     0.51104028    0.007269433  0.4157508 1.00207079      0.98915308
6: 0.86449990  6     0.75420198    0.211342425  0.9837331 0.03520897      0.86080818
weight_children V1_children V2_Children
1:       0.3398447   0.7454582   0.6761706
2:       0.8475106   0.4716267   1.5231691
3:       0.8895790   0.1561395   1.8721056
4:       0.3503219   0.2663775   0.2408758
5:       0.3902352   0.9332958   1.3532260
6:       0.7648748   0.6969372   1.3289579

注意,如果我有许多不同的"类型"(这里只有两个)和/或许多不同的变量要合并(这里只有4个),上面的内容很快就会变得费力和冗长。我使用的数据集有许多不同的变量和类型。因此,我希望以更有效的方式做到这一点。特别地,我希望能够定义一个向量:

merge_variables = c("height", "weight", "V1", "V2") 

和在数据中。表的样式,将父节点和子节点的所有变量合并。我希望新的变量有一个下划线在他们的类型名称(例如height_parentsheight_children)。

我希望我已经把我的要求说清楚了。

谢谢!

我们可以使用mget来做到这一点,即使用paste创建对象名称作为字符串,然后在subsetting 'DT2'后加入on的'ID',并分配列的输出(即mget返回向量的list)以在'DT1'中创建新列

merge_variables = c("height", "weight", "V1", "V2") 
parent_nm <- paste0(merge_variables, "_parents")
child_nm <- paste0(merge_variables, "_children")
DT1[DT2[type == "Parents"], (parent_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]
DT1[DT2[type == "Children"], (child_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]

与产出

> DT1
income ID height_parents weight_parents V1_parents V2_parents height_children weight_children V1_children V2_children
1: 0.26550866  1     0.20597457      0.4820801 0.47761962  0.6781459       0.1765568       0.5995658  0.86120948   1.6788807
2: 0.37212390  2     0.68702285      0.4935413 0.43809711  0.6933670       0.3841037       0.1862176  0.24479728   0.6675499
3: 0.57285336  3     0.76984142      0.8273733 0.07067905  0.9527025       0.4976992       0.6684667  0.09946616   1.7843967
4: 0.90820779  4     0.71761851      0.7942399 0.31627171  1.7286789       0.9919061       0.1079436  0.51863426   0.7799791
5: 0.20168193  5     0.38003518      0.7237109 0.66200508  1.5546414       0.7774452       0.4112744  0.40683019   1.9212360
6: 0.89838968  6     0.93470523      0.8209463 0.91287592  0.8693190       0.2121425       0.6470602  0.29360337   1.4250294
7: 0.94467527  7     0.65167377      0.7829328 0.45906573  0.7999887       0.1255551       0.5530363  0.33239467   0.6507043
8: 0.66079779  8     0.26722067      0.5297196 0.65087047  1.5141743       0.3861141       0.7893562  0.25801678   0.4053845
9: 0.62911404  9     0.01339033      0.0233312 0.47854525  1.4222424       0.3823880       0.4772301  0.76631067   0.2433838
10: 0.06178627 10     0.86969085      0.7323137 0.08424691  0.4909770       0.3403490       0.6927316  0.87532133   0.2866088

或者另一种选择是dcast'DT2'到'wide'格式并进行连接(但是,这里我们没有分配(:=),因此它不会更新原始DT1)

DT1[dcast(DT2, ID ~ type, value.var = merge_variables), on = .(ID)]
income ID height_Children height_Parents weight_Children weight_Parents V1_Children V1_Parents V2_Children V2_Parents
1: 0.26550866  1       0.1765568     0.20597457       0.5995658      0.4820801  0.86120948 0.47761962   1.6788807  0.6781459
2: 0.37212390  2       0.3841037     0.68702285       0.1862176      0.4935413  0.24479728 0.43809711   0.6675499  0.6933670
3: 0.57285336  3       0.4976992     0.76984142       0.6684667      0.8273733  0.09946616 0.07067905   1.7843967  0.9527025
4: 0.90820779  4       0.9919061     0.71761851       0.1079436      0.7942399  0.51863426 0.31627171   0.7799791  1.7286789
5: 0.20168193  5       0.7774452     0.38003518       0.4112744      0.7237109  0.40683019 0.66200508   1.9212360  1.5546414
6: 0.89838968  6       0.2121425     0.93470523       0.6470602      0.8209463  0.29360337 0.91287592   1.4250294  0.8693190
7: 0.94467527  7       0.1255551     0.65167377       0.5530363      0.7829328  0.33239467 0.45906573   0.6507043  0.7999887
8: 0.66079779  8       0.3861141     0.26722067       0.7893562      0.5297196  0.25801678 0.65087047   0.4053845  1.5141743
9: 0.62911404  9       0.3823880     0.01339033       0.4772301      0.0233312  0.76631067 0.47854525   0.2433838  1.4222424
10: 0.06178627 10       0.3403490     0.86969085       0.6927316      0.7323137  0.87532133 0.08424691   0.2866088  0.4909770

最新更新