在R中，在投射(旋转)之前强制惟一的值

我有一个数据框架如下

Identifier  V1  Location    V2
1   12  A   21
1   12  B   24
2   20  B   15
2   20  C   18
2   20  B   23
3   43  A   10
3   43  B   17
3   43  A   18
3   43  B   20
3   43  C   25
3   43  A   30

我想重新转换它，为每个标识符提供一行，为当前位置列中的每个值提供一列。我不关心V1中的数据，但我需要V2中的数据，这些将成为新列中的值。

请注意，对于Location列，标识符2和3有重复的值。

我假设第一个任务是使Location列中的值唯一。

我使用了以下代码(数据帧称为"Test")

L<-length(Test$Identifier)
for (i in 1:L) 
{
temp<-Test$Location[Test$Identifier==i]
temp1<-make.unique(as.character(temp), sep="-")
levels(Test$Location)=c(levels(Test$Location),temp1)
Test$Location[Test$Identifier==i]=temp1
}

这生产

Identifier  V1  Location    V2
1   12  A   21
1   12  B   24
2   20  B   15
2   20  C   18
2   20  B-1 23
3   43  A   10
3   43  B   17
3   43  A-1 18
3   43  B-1 20
3   43  C   25
3   50  A-2 30

然后使用

cast(Test, Identifier ~ Location)

为

Identifier  A   B   C   B-1 A-1 A-2
1   21  24  NA  NA  NA  NA
2   NA  15  18  23  NA  NA
3   10  17  25  20  18  30

这差不多就是我想要的。

我的问题是

这是处理问题的正确方法吗?

我知道r语言的人不使用"for"结构，所以有没有一种更优雅(降级?)的方法来做到这一点?我应该提到的是，实际的数据集有超过160,000行，并且在Location向量中有超过50个唯一的值，函数只需要一个多小时就可以运行。快点就好了。我还应该提到，尽管增加了内存限制，但强制转换函数必须一次在20-30k行输出上运行。然后合并所有cast输出

是否有一种方法可以对输出中的列进行排序，使(这里)它们是a, a -1, a -2, B, B-1, C

请温柔地回复!

通常你的原始格式比你想要的结果要好得多。但是，您可以使用拆分-应用-组合方法轻松地做到这一点，例如，使用package plyr:

DF <- read.table(text="Identifier  V1  Location    V2
1   12  A   21
1   12  B   24
2   20  B   15
2   20  C   18
2   20  B   23
3   43  A   10
3   43  B   17
3   43  A   18
3   43  B   20
3   43  C   25
3   43  A   30", header=TRUE, stringsAsFactors=FALSE)
#note that I make sure that there are only characters and not factors
#use as.character if you have factors
library(plyr)
DF <- ddply(DF, .(Identifier), transform, Loc2 = make.unique(Location, sep="-"))
library(reshape2)
DFwide <- dcast(DF, Identifier ~Loc2, value.var="V2")
#  Identifier  A  B B-1  C A-1 A-2
#1          1 21 24  NA NA  NA  NA
#2          2 NA 15  23 18  NA  NA
#3          3 10 17  20 25  18  30

如果列的顺序对你很重要(通常不是):

DFwide[, c(1, order(names(DFwide)[-1])+1)]
#  Identifier  A A-1 A-2  B B-1  C
#1          1 21  NA  NA 24  NA NA
#2          2 NA  NA  NA 15  23 18
#3          3 10  18  30 17  20 25

作为参考，以下是@Roland以r为基数的等效答案

使用ave创建唯一的"Location"列....

DF$Location <- with(DF, ave(Location, Identifier, 
                    FUN = function(x) make.unique(x, sep = "-")))

…和reshape来改变你的数据结构。

## If you want both V1 and V2 in your "wide" dataset
## "dcast" can't directly do this--you'll need `recast` if you 
##    wanted both columns, which first `melt`s and then `dcast`s....
reshape(DF, direction = "wide", idvar = "Identifier", timevar = "Location")
## If you only want V2, as you indicate in your question
reshape(DF, direction = "wide", idvar = "Identifier", 
        timevar = "Location", drop = "V1")
#   Identifier V2.A V2.B V2.C V2.B-1 V2.A-1 V2.A-2
# 1          1   21   24   NA     NA     NA     NA
# 3          2   NA   15   18     23     NA     NA
# 6          3   10   17   25     20     18     30

可以按照@Roland建议的方式重新排序。

相关内容

最新更新

热门标签：