r语言 - 如何根据每列的前四分位数对数据框进行子集化


#let's make some sample data first
names<- c("t1","t2","t3","t4","t5","t1","t2","t3","t4","t5","t1","t2","t3","t4","t5")
metric1_set1 <- c(2.5,3.1,4.5,2.5,12,7.1,8.5,10,10.1,17.8,12.3,11,10,14,1.5) 
metric1_set2 <- c(2.1,3.1,4.15,2.5,10,7.1,8.5,10,10.1,17.1,12.3,17.3,8,11,1.5) 
metric1_set3 <- c(12.1,13.1,4.15,2.5,10.5,7.1,2.5,10,7.1,11.1,12.3,17.3,8,1.45,1.5) 
dataset1 <- data.frame(names,metric1_set1,metric1_set2,metric1_set3)

names<- c("t1","t2","t3","t4","t5","t1","t2","t3","t4","t5","t1","t2","t3","t4","t5")
metric2_set1 <- c(21.5,13.1,4.5,2.5,12,7.1,8.5,10,10.1,17.8,12.3,11,10,14,1.5) 
metric2_set2 <- c(12.1,3.1,4.15,2.5,10,7.1,8.5,10,8.1,17.1,12.3,17.3,8,1.1,1.5) 
metric2_set3 <- c(2.1,13.1,4.15,2.5,10.5,7.1,21.5,10,7.1,11.1,12.3,12.3,8,1.45,1.5) 
dataset2 <- data.frame(names,metric2_set1,metric2_set2,metric2_set3)

现在的问题是计算数据集1的每一列的前四分位数,然后从数据集2中提取相应的名称。这个想法是获取这些子集值之间的相关性。

quantiles <- apply(dataset1[2:4], 2, quantile, na.rm = TRUE)

将获得四分位数,但实际问题是如何保存与一个数据集的顶级 qunatile 相关的名称,并从两个数据集中删除每隔一行。

根据@sconfluentus的建议,我们可以将其更改为:

 topQuartile<-function(x){   #the function 
 y=quantile(x,  na.rm = TRUE )
 z=y[3]
 return(z)
 }
 quartile_daatset1<- apply( dataset1[2:4] , 2 , topQuartile  )

这完全有效,但我还需要类似于以下内容的内容:

 topquartile_set1 <- subset(dataset1$metric1_set1, subset=(dataset1$metric1_set1 <= quant_daatset1[1]))

我需要适用于每一列的类似代码,并将所有子集放在一个最终数据框中。

最简单的

方法是构建一个包含quantile的函数,提取该函数中的第五个分位数并将其返回到应用程序,如下所示:

fifthQuantile<-function(x){
  y=quantile(x,  na.rm = TRUE )
  z=y[5]
  return(z)
}
 quantiles<- apply( dataset1[2:4] , 2 , fifthQuantile )

这将返回一个数据框,并将旧列名作为行名。如果您希望它们以另一种方式塑造,请尝试:

quantiles<- t(apply( dataset1[2:4] , 2 , fifthQuantile ))

这为您提供了一个转置的数据框,其中包含它们在原始列中的位置!

我将首先使用 tidyr 包收集数据:

library(tidyr)
df.gathered = gather(dataset1, key = "category", value = "value", -names)

结果:

names  category    value
--------------------------
 t1 metric1_set1  2.50
 t2 metric1_set1  3.10
 t3 metric1_set1  4.50
 t4 metric1_set1  2.50
 t5 metric1_set1 12.00
 t1 metric1_set1  7.10
 t2 metric1_set1  8.50
 t3 metric1_set1 10.00
 t4 metric1_set1 10.10
 t5 metric1_set1 17.80 
 ...  # and similar rows for metric1_set2 and metric1_set3 ...

然后,您可以使用 dplyr 中的group_by功能从每个名称和类别中获取最高分位数:

library(dplyr)
df.gathered %>% group_by(names, category) %>% summarise(Q1 = quantile(value, 1))
names   category    Q1
----------------------------
  t1 metric1_set1  12.3
  t1 metric1_set2  12.3
  t1 metric1_set3  12.3
  t2 metric1_set1  11.0
  t2 metric1_set2  17.3
  t2 metric1_set3  17.3
  ...

最新更新