我有一个这样的数据帧:
structure(list(one = structure(1:4, .Label = c("a", "b", "c",
"d"), class = "factor"), two = c(2, 4, 7, 3), x.1 = c("x1a",
"x1b", "x1c", "x1d"), x.2 = c("x2a", "x2b", "x2c", "x2d"), x.3 = c("x3a",
"x3b", "x3c", "x3d"), y.1 = c(NA, "y1b", "y1c", NA), y.2 = c(NA,
"y2b", "y2c", NA), y.3 = c(NA, "y3b", "y3c", NA)), .Names = c("one",
"two", "x.1", "x.2", "x.3", "y.1", "y.2", "y.3"), row.names = c(NA,
-4L), class = "data.frame")
可以看到,事件a、b、c和d的观测值(变量"one")存储为列,其中x和y定义单独的观测值,1、2和3定义变量。变量"two"在这里没有意义。
我想重塑这个数据框架,让它整洁起来,每个观测值都有自己的行,每个变量都有自己的列。
最终的数据帧应该是这样的:
structure(list(one = structure(c(1L, 2L, 2L, 3L, 3L, 4L), .Label = c("a",
"b", "c", "d"), class = "factor"), two = c(2, 4, 2, 7, 5, 3),
var1 = c("x1a", "x1b", "y1b", "x1c", "y1c", "x1d"), var2 = c("x2a",
"x2b", "y2b", "x2c", "y2c", "x2d"), var3 = c("x3a", "x3b",
"y3b", "x3c", "y3c", "x3d")), .Names = c("one", "two", "var1",
"var2", "var3"), row.names = c(1L, 2L, 5L, 3L, 6L, 4L), class = "data.frame")
我对重塑包中的cast和melt功能有点熟悉,但还不能想出一种聪明的方式重塑DF的方法。现在,下面提供了我已经得到的安全:
df.between <- melt(df.in, id.vars=c("one", "two"))
df.between$variable <- gsub("x.|y.", "", df.between$variable)
现在"变量"列确实正确地识别变量(1,2或3)。然而,我无法将其转换为所需的形式,并且由于使用grepl
,此解决方案似乎对较大的数据集并不有用。
很高兴得到正确方向的提示。
我们可以使用data.table
的开发版本的melt
,即v1.9.5
,它可以为measure
变量处理多个patterns
。
library(data.table)
melt(setDT(df1), measure=patterns('.1', '.2', '.3'),
na.rm=TRUE, value.name=paste0('var', 1:3))[, variable:=NULL][order(one)]
# one two var1 var2 var3
#1: a 2 x1a x2a x3a
#2: b 4 x1b x2b x3b
#3: b 4 y1b y2b y3b
#4: c 7 x1c x2c x3c
#5: c 7 y1c y2c y3c
#6: d 3 x1d x2d x3d
编辑:我们不需要c
在patterns
里面,它也会给出精确匹配(来自@Jaap的评论)。
melt
from "data. "表"将比下面快得多,但你也可以考虑merged.stack
从我的"splitstackshape"包:
library(splitstackshape)
na.omit(merged.stack(mydf, var.stubs = c(".1", ".2", ".3"),
sep = "var.stubs", atStart = FALSE))
# one two .time_1 .1 .2 .3
# 1: a 2 x x1a x2a x3a
# 2: b 4 x x1b x2b x3b
# 3: b 4 y y1b y2b y3b
# 4: c 7 x x1c x2c x3c
# 5: c 7 y y1c y2c y3c
# 6: d 3 x x1d x2d x3d
你几乎完成了重塑路线,所以我帮你完成了。你只需要对x和y变量求导。(如果你不想要或不需要它们,以后很容易去掉)。我保留了丢失的数据,因为它们很容易删除,并且可以防止静默删除丢失的数据。
df.between <- melt(df.in, id.vars=c("one", "two"))
#replace with 'var' so no numeric column names.
df.between$variable_n <- gsub("x.|y.", "var", df.between$variable)
df.between$variable_xy <- gsub(".[0-9]","",df.between$variable)
res <- dcast(one+two+variable_xy~variable_n,value.var="value",data=df.between)
> res
one two variable_xy var1 var2 var3
1 a 2 x x1a x2a x3a
2 a 2 y <NA> <NA> <NA>
3 b 4 x x1b x2b x3b
4 b 4 y y1b y2b y3b
5 c 7 x x1c x2c x3c
6 c 7 y y1c y2c y3c
7 d 3 x x1d x2d x3d
8 d 3 y <NA> <NA> <NA>