标题可能看起来有点混乱,所以让我看看我是否可以用一个小例子来澄清:
我有一个包含 3 列的数据框,如下所示
col1 col2 col3
1 A,D,C sd,dg,ds 5,26,1
2 D,F fh,we 85,41
3 H hr 27
4 C,A,D ds,sd,dg 235,65,3
5 Q,G,J rt,gh,we 34,98,65
我想按字母顺序对 col1 的每个元素进行排序,然后根据 col2 和 col3 中的每个元素根据 col1 中的顺序排序,得到这个:
col1 col2 col3
1 A,C,D sd,ds,dg 5,1,26
2 D,F fh,we 85,41
3 H hr 27
4 A,C,D sd,ds,dg 65,235,3
5 G,J,Q gh,we,rt 98,65,34
这很重要,因为稍后我想按 col1 聚合,并且我需要示例中的元素 1 和 4 相等(A,C,D)
到目前为止,我被困在这里:
兆威
my.df <- data.frame(col1=c('A,D,C','D,F','H','C,A,D','Q,G,J'), col2=c('sd,dg,ds','fh,we','hr','ds,sd,dg','rt,gh,we'), col3=c('5,26,1','85,41','27','235,65,3','34,98,65'))
my.df
my.df$col1 <- sapply(sapply(strsplit(as.character(my.df$col1), ','), sort), paste, collapse=',')
my.df
任何帮助感谢!!谢谢!!
您可以将每一行转换为数据框,根据第 1 列对 data.frame 重新排序,然后将其全部粘贴到一起:
# split the entries by commas and
# turn each row of my.df into a data frame
# storing each data frame in a list element
dfList <- lapply(
apply(my.df, 1, strsplit, ","),
function(x) data.frame(x))
# sort each data frame by col1
dfSortedList <- lapply(dfList, function(x) x[with(x, order(col1)), ])
# paste columns back together and arrange as desired
t(sapply(dfSortedList, function(x) apply(x, 2, paste, collapse = ",")))
# col1 col2 col3
#[1,] "A,C,D" "sd,ds,dg" "5,1,26"
#[2,] "D,F" "fh,we" "85,41"
#[3,] "H" "hr" "27"
#[4,] "A,C,D" "sd,ds,dg" "65,235,3"
#[5,] "G,J,Q" "gh,we,rt" "98,65,34"
如有必要,您可以转换回数据框。
你来了:
my.df <- data.frame(col1=c('A,D,C','D,F','H','C,A,D','Q,G,J'), col2=c('sd,dg,ds','fh,we','hr','ds,sd,dg','rt,gh,we'), col3=c('5,26,1','85,41','27','235,65,3','34,98,65'),stringsAsFactors = F)
for (k in 1:dim(my.df)[1]){
tempdf <- data.frame(strsplit(my.df[k,1],","),strsplit(my.df[k,2],","),strsplit(my.df[k,3],","),stringsAsFactors = F)
tempdf <- tempdf[order(tempdf[,1]),]
my.df[k,] <- sapply(tempdf,paste,collapse=",")
}
如您所见,我通过用逗号分隔字符串将每一行转换为临时数据框。然后,您只需按第一列对临时数据框进行排序。然后从那里您将 tempdf 的每一列折叠成一个字符串,并将其替换为原始 my.df
结果:
> my.df
col1 col2 col3
1 A,C,D sd,ds,dg 5,1,26
2 D,F fh,we 85,41
3 H hr 27
4 A,C,D sd,ds,dg 65,235,3
5 G,J,Q gh,we,rt 98,65,34
我们可以使用来自 splitstackshape
和 data.table
的cSplit
来做到这一点。
library(splitstackshape)
na.omit(cSplit(setDT(my.df, keep.rownames=TRUE), 2:4, ",","long"))[
, {i1 <- order(col1)
lapply(.SD, function(x) paste(x[i1], collapse=","))
}, rn][, rn:= NULL][]
# col1 col2 col3
#1: A,C,D sd,ds,dg 5,1,26
#2: D,F fh,we 85,41
#3: H hr 27
#4: A,C,D sd,ds,dg 65,235,3
#5: G,J,Q gh,we,rt 98,65,34
或者稍微长一点的选项是拆分"col1"并使用cSplit
将数据集转换为"长"格式,然后按"col2"和"col3"分组,我们创建一个order
列("i1")并sort
ed "col1"。 然后,将.SDcols
指定为"col2"和"col3",循环使用lapply
,使用,
拆分列,根据带有Map
的"i1"列更改order
,将其paste
在一起并将输出分配(:=
)返回原始列。 如果需要,请将"i1"分配给 NULL。
d1 <- cSplit(my.df, "col1", ",", "long")[,
.(i1 = list(order(col1)), col1 = toString(sort(col1))) ,.(col2, col3)]
d1[, c('col2', 'col3') := lapply(.SD, function(x)
Map(function(x, y) x[y], strsplit(as.character(x), ","), d1$i1)), .SDcols = col2:col3]
d1[, i1:= NULL]
d1[, names(my.df), with = FALSE]
# col1 col2 col3
#1: A, C, D sd,ds,dg 5,1,26
#2: D, F fh,we 85,41
#3: H hr 27
#4: A, C, D sd,ds,dg 65,235,3
#5: G, J, Q gh,we,rt 98,65,34