我有一个数据帧如下:
> sampledput
V1 V2 V3
1 GSM1010983 adipose Bisulfite-Seq
2 GSM1120330 adipose Bisulfite-Seq
3 GSM1120331 adipose Bisulfite-Seq
4 GSM1282348 adipose Bisulfite-Seq
5 GSM1282357 adipose Bisulfite-Seq
6 GSM906416 adipose ChIP-Seq input
7 GSM906394 adipose H3K27ac
8 GSM1010958 adipose mRNA-Seq
9 GSM1120304 adipose mRNA-Seq
10 GSM1120305 adipose mRNA-Seq
11 GSM621443 adipose derived mesenchymal stem cells ChIP-Seq input
12 GSM621420 adipose derived mesenchymal stem cells H3K27me3
13 GSM621446 adipose derived mesenchymal stem cells H3K36me3
14 GSM621418 adipose derived mesenchymal stem cells H3K4me1
15 GSM621458 adipose derived mesenchymal stem cells H3K4me3
16 GSM670020 adipose derived mesenchymal stem cells H3K9ac
17 GSM621398 adipose derived mesenchymal stem cells H3K9me3
我想保留那些第V2
列中的值保持不变的行(例如,adipose
),而第V3
列中的值应包含Bisulfite-Seq
H3K27ac
、ChIP-Seq input
和mRNA-Seq
。如果V3
中有重复值,则只需取其中的 1 个,如您所见,我只选择一行具有值mRNA-Seq
和Bisulfite-Seq
所以在这种情况下,我将得到输出为:
5 GSM1282357 adipose Bisulfite-Seq
6 GSM906416 adipose ChIP-Seq input
7 GSM906394 adipose H3K27ac
8 GSM1010958 adipose mRNA-Seq
这是看跌:
structure(list(V1 = structure(c(2L, 5L, 6L, 7L, 8L, 17L, 16L,
1L, 3L, 4L, 12L, 11L, 13L, 10L, 14L, 15L, 9L), .Label = c("GSM1010958",
"GSM1010983", "GSM1120304", "GSM1120305", "GSM1120330", "GSM1120331",
"GSM1282348", "GSM1282357", "GSM621398", "GSM621418", "GSM621420",
"GSM621443", "GSM621446", "GSM621458", "GSM670020", "GSM906394",
"GSM906416"), class = "factor"), V2 = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("adipose",
"adipose derived mesenchymal stem cells"), class = "factor"),
V3 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 3L, 10L, 10L, 10L,
2L, 4L, 5L, 6L, 7L, 8L, 9L), .Label = c("Bisulfite-Seq",
"ChIP-Seq input", "H3K27ac", "H3K27me3", "H3K36me3", "H3K4me1",
"H3K4me3", "H3K9ac", "H3K9me3", "mRNA-Seq"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -17L))
编辑:"更好"的解决方案
我实际上更喜欢这个,因为我认为代码更合乎逻辑:
library(dplyr)
sampledput %>% group_by(V2) %>%
filter(all(c("Bisulfite-Seq","H3K27ac","ChIP-Seq input","mRNA-Seq") %in% V3)) %>%
distinct(V2,V3)
Source: local data frame [4 x 3]
Groups: V2 [1]
V1 V2 V3
(fctr) (fctr) (fctr)
1 GSM1010983 adipose Bisulfite-Seq
2 GSM906416 adipose ChIP-Seq input
3 GSM906394 adipose H3K27ac
4 GSM1010958 adipose mRNA-Seq
这将测试所有所需的 V3 值是否包含在 V2 的每个值中。然后它仍然会过滤掉任何重复项。
原始解决方案
dplyr
解决方案
library(dplyr)
sampledput %>% group_by(V2) %>%
filter(V3 %in% c("Bisulfite-Seq","H3K27ac","ChIP-Seq input","mRNA-Seq")) %>%
distinct(V2,V3) %>% filter(length(unique(V3))==4)
Source: local data frame [4 x 3]
Groups: V2 [2]
V1 V2 V3
(fctr) (fctr) (fctr)
1 GSM1010983 adipose Bisulfite-Seq
2 GSM906416 adipose ChIP-Seq input
3 GSM906394 adipose H3K27ac
4 GSM1010958 adipose mRNA-Seq
但请注意,在执行distinct(V2,V3)
时,它将抓取该副本的第一次出现。在您想要的输出中,您列出了GSM1282357
而我的解决方案返回GSM1010983
.不确定这是否是您关心的问题。
您必须测试这是否泛化到您的整个数据集,但它确实会产生您想要的输出。
也许有点太简单了,但是...
library(dplyr)
result <- sampledput %>% group_by(V2, V3) %>% summarise(V1 = V1[length(V1)])
这将返回每个组的最后一个 GSM,就像您的理想输出一样。
我们也可以使用data.table
library(data.table)
setDT(sampledput)[, .(V1 = last(V1)), .(V2, V3)]