R:数据帧根据另一列拆分和排序列



标题可能看起来有点混乱,所以让我看看我是否可以用一个小例子来澄清:

我有一个包含 3 列的数据框,如下所示

   col1     col2     col3
1 A,D,C sd,dg,ds   5,26,1
2   D,F    fh,we    85,41
3     H       hr       27
4 C,A,D ds,sd,dg 235,65,3
5 Q,G,J rt,gh,we 34,98,65
我想按字母顺序对 col1 的每个元素进行排序,

然后根据 col2 和 col3 中的每个元素根据 col1 中的顺序排序,得到这个:

   col1     col2     col3
1 A,C,D sd,ds,dg   5,1,26
2   D,F    fh,we    85,41
3     H       hr       27
4 A,C,D sd,ds,dg 65,235,3
5 G,J,Q gh,we,rt 98,65,34

这很重要,因为稍后我想按 col1 聚合,并且我需要示例中的元素 1 和 4 相等(A,C,D)

到目前为止,我被困在这里:

兆威

my.df <- data.frame(col1=c('A,D,C','D,F','H','C,A,D','Q,G,J'), col2=c('sd,dg,ds','fh,we','hr','ds,sd,dg','rt,gh,we'), col3=c('5,26,1','85,41','27','235,65,3','34,98,65'))
my.df
my.df$col1 <- sapply(sapply(strsplit(as.character(my.df$col1), ','), sort), paste, collapse=',')
my.df

任何帮助感谢!!谢谢!!

您可以将每一行转换为数据框,根据第 1 列对 data.frame 重新排序,然后将其全部粘贴到一起:

# split the entries by commas and
# turn each row of my.df into a data frame
# storing each data frame in a list element
dfList <- lapply(
  apply(my.df, 1, strsplit, ","),
  function(x) data.frame(x))
# sort each data frame by col1
dfSortedList <- lapply(dfList, function(x) x[with(x, order(col1)), ])
# paste columns back together and arrange as desired
t(sapply(dfSortedList, function(x) apply(x, 2, paste, collapse = ",")))
#     col1    col2       col3      
#[1,] "A,C,D" "sd,ds,dg" "5,1,26"  
#[2,] "D,F"   "fh,we"    "85,41"   
#[3,] "H"     "hr"       "27"      
#[4,] "A,C,D" "sd,ds,dg" "65,235,3"
#[5,] "G,J,Q" "gh,we,rt" "98,65,34"

如有必要,您可以转换回数据框。

你来了:

my.df <- data.frame(col1=c('A,D,C','D,F','H','C,A,D','Q,G,J'), col2=c('sd,dg,ds','fh,we','hr','ds,sd,dg','rt,gh,we'), col3=c('5,26,1','85,41','27','235,65,3','34,98,65'),stringsAsFactors = F)
for (k in 1:dim(my.df)[1]){
    tempdf <- data.frame(strsplit(my.df[k,1],","),strsplit(my.df[k,2],","),strsplit(my.df[k,3],","),stringsAsFactors = F)
    tempdf <- tempdf[order(tempdf[,1]),]
    my.df[k,] <- sapply(tempdf,paste,collapse=",")
}

如您所见,我通过用逗号分隔字符串将每一行转换为临时数据框。然后,您只需按第一列对临时数据框进行排序。然后从那里您将 tempdf 的每一列折叠成一个字符串,并将其替换为原始 my.df

结果:

> my.df
   col1     col2     col3
1 A,C,D sd,ds,dg   5,1,26
2   D,F    fh,we    85,41
3     H       hr       27
4 A,C,D sd,ds,dg 65,235,3
5 G,J,Q gh,we,rt 98,65,34

我们可以使用来自 splitstackshapedata.tablecSplit来做到这一点。

library(splitstackshape)
na.omit(cSplit(setDT(my.df, keep.rownames=TRUE), 2:4, ",","long"))[
        , {i1 <- order(col1)
         lapply(.SD, function(x) paste(x[i1], collapse=","))
     }, rn][, rn:= NULL][]
#   col1     col2     col3
#1: A,C,D sd,ds,dg   5,1,26
#2:   D,F    fh,we    85,41
#3:     H       hr       27
#4: A,C,D sd,ds,dg 65,235,3
#5: G,J,Q gh,we,rt 98,65,34

或者稍微长一点的选项是拆分"col1"并使用cSplit将数据集转换为"长"格式,然后按"col2"和"col3"分组,我们创建一个order列("i1")并sort ed "col1"。 然后,将.SDcols指定为"col2"和"col3",循环使用lapply,使用,拆分列,根据带有Map的"i1"列更改order,将其paste在一起并将输出分配(:=)返回原始列。 如果需要,请将"i1"分配给 NULL。

d1 <- cSplit(my.df, "col1", ",", "long")[, 
 .(i1 = list(order(col1)), col1 = toString(sort(col1))) ,.(col2, col3)]
d1[,  c('col2', 'col3') := lapply(.SD, function(x) 
  Map(function(x, y) x[y], strsplit(as.character(x), ","), d1$i1)), .SDcols = col2:col3]
d1[, i1:= NULL]
d1[, names(my.df), with = FALSE]
#     col1     col2     col3
#1: A, C, D sd,ds,dg   5,1,26
#2:    D, F    fh,we    85,41
#3:       H       hr       27
#4: A, C, D sd,ds,dg 65,235,3
#5: G, J, Q gh,we,rt 98,65,34

最新更新