如何按组在 R 中的数据帧中执行 KS.test

我想在 R 中对下面的示例数据帧"数据"执行双样本 Kolmogorov-Smirnov (KS( 测试：

    Protein1    Protein2    Protein3    Protein4    Protein5    Protein6    Protein7    Protein8    Group
Sample1 0.56    1.46    0.64    2.53    1.96    305.29  428.41  113.22  Control
Sample2 0.75    2.29    0.38    4.31    1.72    307.95  492.58  82.75   Control
Sample3 2.05    1.73    2.42    14.75   2.92    523.92  426.94  131.51  Control
Sample4 1.71    5.37    0.68    6.39    2.02    343.05  435.16  123.12  Control
Sample5 13.31   0.94    1.21    3.83    2.83    313.71  327.84  66.8    Control
Sample6 0.36    1.81    0.42    2.25    1.48    335.23  352.55  93.81   Control
Sample7 0.28    3.26    0.49    2.62    1.96    251.49  468.19  80.27   Control
Sample8 1.62    17.01   0.49    2.87    1.7 254.79  402.9   86.8    Control
Sample9 0.59    2.64    0.18    2.93    1.23    388.87  384.53  109.52  Control
Sample10    1.67    3.75    0.43    3.89    1.83    306.28  440.86  97.55   Control
Sample11    15.53   12.02   0.57    1.81    2.31    328.56  352.98  118.18  Control
Sample12    0.94    7.06    35.77   4.98    2.44    389.14  376.18  119.75  Control
Sample13    2.07    1.73    0.38    3.89    2.56    233.81  377.21  72.1    Control
Sample14    3.2 1.38    0.5 4.05    2.51    406.57  538.57  209.16  Patient
Sample15    1.33    2.17    0.46    3.31    1.72    278.04  276.37  79.4    Patient
Sample16    1.48    2.9 0.84    9.27    1.94    332.76  413.66  99.09   Patient
Sample17    2.02    1.96    0.25    4.16    1.96    358.73  383.63  107.46  Patient
Sample18    2.94    2   0.27    3.99    2.55    354.78  493.02  145.36  Patient
Sample19    1.01    8.1 0.35    3.65    1.62    335.18  264.74  145.15  Patient
Sample20    6.95    15.48   2.94    3.64    2.3 307.23  484.38  119.61  Patient
Sample21    0.52    1.38    0.56    3.08    1.86    244.3   304.74  76.87   Patient
Sample22    0.35    2.17    0.38    4.51    2.09    304.68  369.98  151.76  Patient
Sample23    2.26    2.9 0.3 4.44    2.43    302.51  367.51  150.69  Patient
Sample24    3.19    1.96    0.81    2.94    2.15    309.59  362.18  133.49  Patient
Sample25    1.12    2   0.71    3.77    2.42    334.36  358.9   131.35  Patient
Sample26    5.28    8.1 0.81    22.84   2.35    422.18  369.71  76.35   Patient

要在 2 个单独的列之间执行 KS 测试，代码如下：

> ks.test(data$Protein1, data$Protein2, data=data)
    Two-sample Kolmogorov-Smirnov test
data:  data$Protein1 and data$Protein2
D = 0.42308, p-value = 0.01905
alternative hypothesis: two-sided
Warning message:
In ks.test(data$Protein1, data$Protein2, data = data) :
  cannot compute exact p-value with ties

但是，我想按列和组执行此操作。例如，对于 t.test 或 wilcox.test 来说，这样做很容易，因为你可以编码为t.test(y1，y2( 或 t.test(y~x( # 其中 y 是数字，x 是二进制因子但是当涉及到二进制因子时，没有ks.test的代码。谁能帮忙？

最终，我想为所有蛋白质的整个数据帧执行此操作，因为我可以成功地进行t检验，但想为ks.test做如下：

t_tests <- lapply(
  data[, -1], # apply the function to every variable *other than* the first one (group)
  function(x) { t.test(x ~ HealthGroups, data = data) }
)

任何帮助表示赞赏。提前谢谢你。

这是一个非常简单的方法。这使用循环，这在 R 圈中通常不受欢迎。但是，它非常简单且不言自明，这对于新用户来说可能是一个加分项，并且在这种情况下循环太慢没有问题。 (请注意，如果您愿意，可以将其用于使用lapply()，但这仍然是一个循环，只是在外面看起来不同。

只需创建两个具有相同变量的新子集数据框。然后遍历调用 ks.test 的数据帧。输出不是很用户友好 - 它只会说j - 所以我添加了对？writeLines 的调用来打印正在测试的变量的名称。

# I am assuming the original data frame is called d
dc <- d[d$Group=="Control",]
dp <- d[d$Group=="Patient",]
for(j in 1:8){  
  writeLines(names(dc)[j])
  print(ks.test(dc[,j], dp[,j]))  
}
# Protein1
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.30769, p-value = 0.5882
# alternative hypothesis: two-sided
# 
# Protein2
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.23077, p-value = 0.8793
# alternative hypothesis: two-sided
# 
# Protein3
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.23077, p-value = 0.8793
# alternative hypothesis: two-sided
# 
# Protein4
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.46154, p-value = 0.1254
# alternative hypothesis: two-sided
# 
# Protein5
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.23077, p-value = 0.8793
# alternative hypothesis: two-sided
# 
# Protein6
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.15385, p-value = 0.9992
# alternative hypothesis: two-sided
# 
# Protein7
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.38462, p-value = 0.2999
# alternative hypothesis: two-sided
# 
# Protein8
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.46154, p-value = 0.1265
# alternative hypothesis: two-sided
# 
# Warning messages:
# 1: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
# 2: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
# 3: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
# 4: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties

这将创建一个数据框来容纳结果：

dk <- data.frame(Protein=character(8), D=numeric(8), p=numeric(8), stringsAsFactors=F)
for(j in 1:8){  
  k <- ks.test(dc[,j], dp[,j])
  dk$Protein[j] <- names(dc)[j]
  dk$D[j]       <- k$statistic
  dk$p[j]       <- k$p.value
}
dk
#    Protein         D         p
# 1 Protein1 0.3076923 0.5881961
# 2 Protein2 0.2307692 0.8793244
# 3 Protein3 0.2307692 0.8793244
# 4 Protein4 0.4615385 0.1253895
# 5 Protein5 0.2307692 0.8793244
# 6 Protein6 0.1538462 0.9992124
# 7 Protein7 0.3846154 0.2999202
# 8 Protein8 0.4615385 0.1264877

相关内容

最新更新

热门标签：